Clova Music: DJ AI (Adrian Kim), M.S. Clova AI - - PowerPoint PPT Presentation

clova music dj ai
SMART_READER_LITE
LIVE PREVIEW

Clova Music: DJ AI (Adrian Kim), M.S. Clova AI - - PowerPoint PPT Presentation

Clova Music: DJ AI (Adrian Kim), M.S. Clova AI Research(CLAIR), Naver Corp. Clova: Cloud-based Virtual Assistant General Purpose AI platform Clova: Cloud-based Virtual Assistant https://clova.ai Clova:


slide-1
SLIDE 1

Clova Music: 똑똑한 DJ같은 AI비서

Clova AI Research(CLAIR), Naver Corp. 김정명 (Adrian Kim), M.S.

slide-2
SLIDE 2

Clova: Cloud-based Virtual Assistant

General Purpose AI platform

slide-3
SLIDE 3

Clova: Cloud-based Virtual Assistant

https://clova.ai

slide-4
SLIDE 4

Clova: Cloud-based Virtual Assistant

slide-5
SLIDE 5

Clova Music

  • The biggest need from a speaker would be MUSIC

= Music Listening Platform?

slide-6
SLIDE 6

Clova Music

  • Intelligent music recommendation service of Clova
  • Aims to be a human DJ-like curator
  • Powered with NAVER/LINE music user/content data
slide-7
SLIDE 7

Contents

  • Part 1 Short Tutorial on Music modeling
  • What kind of data do we use?
  • What kind of models can we use?
  • What kind of problems can we solve?
  • Any industry research?
  • Part 2 Music Research in Clova
  • Recommendation Systems
  • Representation learning
  • Emotion recognition
  • Highlight extraction
  • Automatic DJ list generation
slide-8
SLIDE 8

Introducing the Music Domain

slide-9
SLIDE 9

Popular Domains...

slide-10
SLIDE 10

Audio domain data

+

slide-11
SLIDE 11

Wave

  • Basic data form is 16 bit integer
  • You can normalize to [-1, 1]
  • 1D vector of samples
  • 16kHz, 22050Hz, ...
  • Very information inefficient

For 16kHz, 30 seconds = 480k datapoints! Audio Domain Data

slide-12
SLIDE 12

Spectrograms

Audio Domain Data Expressive, has more information!

slide-13
SLIDE 13

Mel-spectrograms

Mel Filter banks frequency bins > 1k Audio Domain Data mel bins = 80, 96, 128

slide-14
SLIDE 14

Mel-spectrograms

  • Mel-spectrogram filter distributions give relative focus on lower frequency bins

Image from Choi, et. al. 16 Audio Domain Data

slide-15
SLIDE 15

Transformation between data types

If complex, inverse stft If only magnitude, Griffin-Lim algorithm stft mel filter bank irreversable WavenetVocoder (Shen et al. 17) Audio Domain Data (128, 2584) (1025, 2584) (1323000,) =2648600 =330752

slide-16
SLIDE 16

Dirty Labels Low efficiency Must hear to evaluate Convoluted Multiple sources Too large

Audio Data

Issues

Storage problem Memory problem Low quality, weakly labeled (Choi et al. 2017) Takes a lot of time for high quality Not much open data Information per data point is very small

slide-17
SLIDE 17

Comparing Simple Tasks

MNIST GTZAN Storage 45MB 1.2GB Data pairs 60000 1000 (30 second) Classes 10 digits 10 genres (100 each) Preprocessing Fast Slow Testing Easy Hard Issues

slide-18
SLIDE 18

Comparing Speech and Music

Bad Boy – Red Velvet News Speech Audio Short, Single source Long, Multiple source Issues

slide-19
SLIDE 19

Example Baselines

slide-20
SLIDE 20

What kind of problems can we solve?

  • Genre/Artist Classification
  • Automatic Tagging
  • Music generation
  • Style transfer
  • Source separation
  • Onset detection
  • Sound embedding
  • Beat tracking
  • and more...!
LSTM

e el ion g er n ion

LSTM LSTM LSTM

tional ers

LSTM output Attention-weighted

Attention (softmax)

er

Element-wise multiplication

ected Convolution & pooling Channel summation er

slide-21
SLIDE 21

Autotagging with Convnets

  • Input: mel-spectrogram (MSD dataset)
  • Output: Tags (50 top tags)

Automatic tagging using deep convolutional neural networks, ISMIR 16, Choi et. al

https://github.com/keunwoochoi/music-auto_tagging-keras

2D Convs

slide-22
SLIDE 22

Note: Filter design in CNNs

2D convs 1D convs Slow training Fast training Local structure in freq Frequencies are discrete nxm filters, 1 channel nx1 filters, m channels

slide-23
SLIDE 23

Auto Music Transcription with Deep Complex Networks

  • Input: Spectrogram complex output
  • Change network components (batchnorm, initialization, activations, convolution)

to match complex domain

Real: real and imaginary values as separate channels complex: as suggested

Deep Complex Networks, Trabelsi et al., To appear at ICLR18

slide-24
SLIDE 24

WaveNet for TTS

  • Input: wav format data

Image from https://kakalabblog.wordpress.com/2017/07/18/wavenetnsynth-deep-audio-generative-models/

WaveNet: A Generative Model for Raw Audio, Oord et al., https://arxiv.org/pdf/1609.03499.pdf

slide-25
SLIDE 25

Industries focusing on Music Research

and more!

slide-26
SLIDE 26

NSynth: Encoding sounds with Wavenet Autoencoder

https://magenta.tensorflow.org/nsynth

  • Wavenet based model made by Magenta to produce a neural synthesizer
  • Latent embeddings(Z) from various sounds made by the model can be used to

produce new sounds

  • New dataset with instrument, pitch, etc. tags on individual sounds
slide-27
SLIDE 27

Performance RNN

  • Trained by Yamaha e-Piano Competition dataset
  • Midi of 1400+ piano performances
  • Magenta used LSTMs to predict from 388 events occuring during the timeline

Generated example

https://magenta.tensorflow.org/performance-rnn

slide-28
SLIDE 28

Discover Weekly

  • Spotify’s weekly personalized recommendation service
  • Collaborative Filtering
  • NLP modeling
  • Audio modeling

http://benanne.github.io/2014/08/05/spotify-cnns.html#contentbased http://blog.galvanize.com/spotify-discover-weekly-data-science/

slide-29
SLIDE 29

Any questions?

  • onto part 2..
slide-30
SLIDE 30

Clova Music Recommendation System

slide-31
SLIDE 31

Recommendation in Clova Music

  • User logs as main data, content data hybrid is possible
  • Large and sparse online data
  • Topics:
  • User log analysis
  • Music semantic embedding learning
  • Collaborative filtering with matrix factorization
slide-32
SLIDE 32

Top queries with Music

  • 노래 틀어줘
  • 자장가 틀어줘
  • 동요 틀어줘
  • 신나는 노래 틀어줘
  • 조용한 노래 틀어줘
  • 핑크퐁 노래 틀어줘
  • 아이유 노래 틀어줘
  • 클래식 틀어줘
  • 분위기 좋은 음악 틀어줘
  • 잔잔한 음악 틀어줘
  • 발라드 틀어줘
  • Artists > Tracks
  • Genre, mood, themes > Artists
  • JUST PLAY > Genres

* Reported at 2017 Oct.

slide-33
SLIDE 33

Device Usage Patterns

5 10 15 20 25 NAVER_APP NAVER_PC WAVE CLOVA_APP

* Reported at 2017 Oct.

slide-34
SLIDE 34

가요 기능성음악 팝 동요 OST 클래식 재즈 종교음악 일렉트로… 락 힙합 기타 NAVER_APP WAVE

Device Usage Patterns

* Reported at 2017 Oct.

slide-35
SLIDE 35

Device Usage Patterns

* Reported at 2017 Oct.

Playing ratio Artist

  • Artists / Play count ratio
  • Long tail distribution
  • Distribution itself is not so

different...

slide-36
SLIDE 36

Device Usage Patterns

* Reported at 2017 Oct.

Playing ratio Artist

WAVE NAVER MUSIC APP

핑크퐁 EXO 아이유 아이유 동요 젝스키스 동요 방탄소년단 EXO 뉴이스트 뉴이스트(NU`EST) 윤종신 Wanna One 별하나 동요 윤종신 이루마 우원재 오르골뮤직 볼빨간사춘기 볼빨간사춘기 뉴이스트 W 젝스키스 황치열 트니트니 헤이즈 헤이즈 선미 성시경 WINNER 힐링피아노 자장가

slide-37
SLIDE 37

Implication

  • Paradigm shift in terms of music consumption on AI speaker devices
  • New market
  • Kids, New parents
  • Lean-out music, lounge music
  • Classic, Jazz
  • Music Recommendation takes an important role on AI assistant platforms
slide-38
SLIDE 38

Recommendation Challenges

Lack of well-defined meta data Personalized Playlists Musical Semantic Embedding Multimodal Semantic Embedding

slide-39
SLIDE 39

Music Semantic Embedding

  • Mapping tracks, artists, and words to the same embedding space
  • Word2Vec
  • Feature learning
  • Usages
  • Item similarities
  • Used as features

가을 신나는

Lack of well-defined meta data

Semantic Embedding

slide-40
SLIDE 40

Word2Vec with Tagged playlists

  • JAMM playlists
  • User-created playlists in Naver Music
  • About 72,000 playlists
  • Keywords from tags
  • Artists from tracks
  • Treat trackIds as ”words” within a playlist

Lack of well-defined meta data

Semantic Embedding

slide-41
SLIDE 41

That song in the charts

  • 벚꽃엔딩/버스커버스커

Semantic Embedding

slide-42
SLIDE 42

Multimodal Semantic Embedding

  • We would want to model different playlists for different personalities
  • Query: 밤편지

< 밤편지_2 > < 밤편지_1 > Personalized Playlists

Semantic Embedding

slide-43
SLIDE 43

Embedding with session data

  • User playing sequence as document!
  • We use multimodal word distributions

formed from Gaussian distributions

Ben Athiwaratkun and Andrew Gordon Wilson, Multimodal Word Distributions, 2017

Personalized Playlists

Semantic Embedding

slide-44
SLIDE 44

Collaborative Filtering

Most popular method: Matrix Factorization

slide-45
SLIDE 45

Matrix Factorization for Personalized Recommendation

  • Basic MF objective
  • Select tracks and artists that user prefers when generating a playlist
  • Simple, but hard to apply
  • Sparsity
  • Overfitting / Underfitting
  • Hard to evaluate (need real feedback, not rmse!)
  • Combining with other models

Collaborative Filtering

slide-46
SLIDE 46

What can we do?

  • Learning in 2 phases
  • Long term: batch learning
  • Short term: online learning
  • Negative sampling
  • When doing negative sampling, consider item distribution
  • Remove abusing users
  • Over clicking users
  • Top 100 only users

Collaborative Filtering

slide-47
SLIDE 47

Remaining Challenges

  • Conventional problems
  • Sparsity
  • Top 100 songs
  • Cold-start problems
  • Explanatory recommendation
  • Music Recommendation for AI Speakers
  • Interaction
  • Lean-in / Lean-back
  • Personalizing level (Familiar vs New)
slide-48
SLIDE 48

Music Modeling

slide-49
SLIDE 49

Music Modeling

  • Audio data as main data
  • Topics:
  • Representation Vector Extraction (Park et al. 17)
  • Music Emotion Recognition (Jeon et al. 17)
  • Music Highlight Extraction (Ha et al. 17)
  • Automatic DJ mix Generation (Kim et al. 17)
slide-50
SLIDE 50

Extracting representation vectors by Artist Classification

  • Artists are complete information for music: Very Rare!
  • Any data can be used

DeepArtistID Using more artists increase the representation quality ArtistNet

Representation Learning Using Artist Labels for Audio Classification Tasks, Park et al., MIREX17 Challenge Representation Learning of Music Using Artist Labels, Parketal.,https://arxiv.org/abs/1710.06648

slide-51
SLIDE 51

Using extracted features

  • Transfer learning for many tasks such as genre, mood classification
  • MIREX 17 Challenge for Mood prediction 1st place

ArtistNet

Representation Learning of Music Using Artist Labels, Parketal.,https://arxiv.org/abs/1710.06648 Representation Learning Using Artist Labels for Audio Classification Tasks, Park et al., MIREX17 Challenge

slide-52
SLIDE 52

Visualization

ArtistNet

Representation Learning Using Artist Labels for Audio Classification Tasks, Park et al., MIREX17 Challenge Representation Learning of Music Using Artist Labels, Parketal.,https://arxiv.org/abs/1710.06648

slide-53
SLIDE 53

Querying (artist level)

ArtistNet

Representation Learning Using Artist Labels for Audio Classification Tasks, Park et al., MIREX17 Challenge Representation Learning of Music Using Artist Labels, Parketal.,https://arxiv.org/abs/1710.06648

slide-54
SLIDE 54

Querying (song level)

  • bob marley and the wailers– three little birds
  • dennis brown – tribulation
  • junior murvin – police and theives
  • norah jones – dont know why
  • dionne warwick – walk on by
  • jewel – enter from the east

ArtistNet

Representation Learning Using Artist Labels for Audio Classification Tasks, Park et al., MIREX17 Challenge Representation Learning of Music Using Artist Labels, Parketal.,https://arxiv.org/abs/1710.06648

slide-55
SLIDE 55

Music Emotion Recognition

Music Emotion Recognition

Music Emotion Recognition via End-to-End Multimodal Neural Networks,Jeon etal.,RecSys17

slide-56
SLIDE 56

4 / 30

Multimodal approach

  • Task
  • Predict given track’s polarity (positive/negative)
  • Data – Naver Music Polarity Emotion Dataset
  • 7484 tracks (Pos: Neg = 1:1)
  • Polarity Emotion label
  • Lyrics (27496, 400) word vectors
  • Mel spectrograms (128 mel bins, 0.06s per frame)

Music Emotion Recognition

Music Emotion Recognition via End-to-End Multimodal Neural Networks,Jeon etal.,RecSys17

slide-57
SLIDE 57

Results

  • Using both lyrics AND audio enhance the accuracy a lot.

Data Model Accuracy Audio CNN 0.6479 RNN 0.6303 MCRN 0.6619 Lyrics CNN 0.7815 RNN 0.7716 Both MCRN, CNN 0.8046

Music Emotion Recognition

Music Emotion Recognition via End-to-End Multimodal Neural Networks,Jeon etal.,RecSys17

slide-58
SLIDE 58

Lyric Analysis

Music Emotion Recognition

Music Emotion Recognition via End-to-End Multimodal Neural Networks,Jeon etal.,RecSys17

slide-59
SLIDE 59

Automatic Highlight Extraction

  • Original highlights in Naver Music: First 1 minute
  • If there are no explicit highlights given, how can we service better highlights?

Music Highlight Extraction via Convolutional Recurrent Attention Networks, Ha et al., ICML17 ML4MD Workshop

slide-60
SLIDE 60
  • Improve user experience, recommendations
  • Increase in potential customers (mp3 sales)
  • Valuable dataset

Why highlights?

Automatic Highlight Extraction

Music Highlight Extraction via Convolutional Recurrent Attention Networks, Ha et al., ICML17 ML4MD Workshop

slide-61
SLIDE 61

Formulation

  • Finding significant snippets within a track is an interesting and valuable task
  • Given a track x, where should a ‘Highlight’ start?

4000 H H+S mel-spectrogram of x Input: mel-spectrogram of x Output: Starting frame (H) of highlight of size S Automatic Highlight Extraction

Music Highlight Extraction via Convolutional Recurrent Attention Networks, Ha et al., ICML17 ML4MD Workshop

slide-62
SLIDE 62
  • 1. Train a deep learning model (CRAN) with attention
  • 2. Inference attention values and use them to compute highlight position

Framework

Automatic Highlight Extraction

Music Highlight Extraction via Convolutional Recurrent Attention Networks, Ha et al., ICML17 ML4MD Workshop

slide-63
SLIDE 63

Data

  • 32083 full tracks tagged as multi-labels
  • Chosen based on play counts (Dec 16~Jan 17)
  • 10 genres
  • Korea’s music market is very biased to specific genres!

Automatic Highlight Extraction

0.05 0.1 0.15 0.2 0.25 0.3 0.35 Total Popular (Top 10%) New Released (Top 10%)

Music Highlight Extraction via Convolutional Recurrent Attention Networks, Ha et al., ICML17 ML4MD Workshop

slide-64
SLIDE 64

CRAN(Convolutional Recurrent Attention Network)

Conv layer outputs Genre classification Automatic Highlight Extraction

Music Highlight Extraction via Convolutional Recurrent Attention Networks, Ha et al., ICML17 ML4MD Workshop

slide-65
SLIDE 65

Training evaluation results

  • Using RNNs and having attention give better test results

1 1.4 1.8 2.2 2.6 3 1 100 CNN CRNN C-HiEx R-HiEx

Automatic Highlight Extraction

Music Highlight Extraction via Convolutional Recurrent Attention Networks, Ha et al., ICML17 ML4MD Workshop

slide-66
SLIDE 66

Computing the Highlight

Automatic Highlight Extraction Attention Mel energy Attention-weighted Energy from frame n Cumulative Sum of S energy values Speed/Acceleration of energy change Highlight score

  • f frame n

Frame on where highlight of length S starts

Music Highlight Extraction via Convolutional Recurrent Attention Networks, Ha et al., ICML17 ML4MD Workshop

slide-67
SLIDE 67

Evaluating Highlights

  • Our model has the most overlaps on true labels
  • Also has best qualitative score

Automatic Highlight Extraction

Music Highlight Extraction via Convolutional Recurrent Attention Networks, Ha et al., ICML17 ML4MD Workshop

slide-68
SLIDE 68

Highlight examples

  • 30 second clips automatically extracted
  • Sugar / Maroon 5 -
  • Faded / Alan Walker –
  • This Is What You Came For / Calvin Harris -
  • Lucky Strike / Maroon 5 –
  • Dream Girls / I.O.I –
  • Napal Baji / Psy -
slide-69
SLIDE 69

Example application

  • What is a DJ mix?

Naturally mixed sequence of music clips as a single song

Automatic DJ mix generation using highlight detection, Kim et al., ISMIR 17 late breaking session

slide-70
SLIDE 70

Automatic DJ mix generation

  • Interacting music service scenario
  • Entertainment in AI Speakers
  • Simple framework, yet hard to optimize

Automatic DJ Mix Generation

Automatic DJ mix generation using highlight detection, Kim et al., ISMIR 17 late breaking session

slide-71
SLIDE 71

Framework

Automatic DJ Mix Generation

Automatic DJ mix generation using highlight detection, Kim et al., ISMIR 17 late breaking session

slide-72
SLIDE 72

Segmenting Music

Automatic DJ Mix Generation Segments we want to play can be either highlights, or full songs Downbeat segmentation is critical

Automatic DJ mix generation using highlight detection, Kim et al., ISMIR 17 late breaking session

slide-73
SLIDE 73

Scoring the next candidate

  • Energy difference between segments
  • Representation vectors

Automatic DJ Mix Generation

Features extracted from ArtistNet and mapped with t-SNE

Automatic DJ mix generation using highlight detection, Kim et al., ISMIR 17 late breaking session

slide-74
SLIDE 74

Conclusion

slide-75
SLIDE 75

Conclusion

  • Looked at music data on a AI researcher perspective
  • Baselines for music research
  • Clova Music data analysis on recommendation
  • Publications and future research for music content
  • Very challenging, Still a blue ocean!
slide-76
SLIDE 76

Future problems for Music and Clova?

  • Onset detection
  • Music segmenting
  • Music generation
  • Source separation
  • Cold start recommendation
  • Balancing familiar vs. exploration
  • Good evaluation metrics
  • Reinforcement Learning for recommendation
slide-77
SLIDE 77

Clova AI Research

  • Team responsible for Clova-oriented advanced AI research
  • Open, Collaborative, and Self-motivated
  • Outstanding global team (working language: English)
  • Position: Research scientists, PostDoc, AI SW engineers, Internship researchers
  • Research infrastructure and supporting: nsml
  • Your KPI may include publication
  • Advisory Members: Active research involvement + authorships

조경현(NYU) 임재환(USC) 김성훈(HKUST) 박혜원(MIT) 신진우(KAIST) 주재걸(고려대)

Jun-Yan Zhu(MIT)

Hannaneh Hajishirzi (UW)

slide-78
SLIDE 78

Thank you.

We are hiring! adrian.kim@navercorp.com