 
              Music transcription via convex optimization Song Mei ICME, Stanford June 3, 2015 Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 1 / 28
The problem. Record the music. Signal on the computer. Music scores. Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 2 / 28
The problem. The problem: Given a music signal played by various instruments, we would like to separate the melodies from different instruments and take down the music scores. Why is the problem important? Music scores are easier to read. — To help people to learn to play instruments. Music scores are easier to be compared. — When you hear a song but you don’t know its name, how do you search for it? Music scores are easier to revise. — Musicians and music technicians would like to perform some revision on the scores. Music and Maths have instrinsic connections! Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 3 / 28
Background knowledge of music. A piece of music is composed of musical notes. The common elements of music are, Pitch, Timbre, Loudness, Rythms, Speed, etc. The pitch, the timbre, and the loudness are the most important elemtents. Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 4 / 28
Pitch. Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 5 / 28
Pitch. On a piano keyboard, there are 88 different keys, 55 white, 32 black. They are labeled from A 0 to C 8. ( A 0, A + 0, B 0, C 0, ..., G 0, G + 0, A 1, ..., G + 1, A 2, ..., A 8, ..., C 8.) One key corresponds to one frequency. On a standard instument, the map between pitches and frequencies is defined as General formula, pitch = 69 + 12 × log 2 ( frequency / 440). A 4 has frequency 440 Hz . A 5 has frequency 880 Hz , two times larger than A 4. Between A 4 and A 5, there are 11 keys. They are logarithmically equally spaced with 2 1 / 12 . 2 7 / 12 = 1 . 4983 . . . ≈ 1 . 5. The frequency of different keys share quasi - rational ratios. The rational relation brings harmony to music. Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 6 / 28
Timbre. The waveshapes of real instruments are not sinosoid. Piano. Guitar. The integer multiplications of the fundamental frequency are called the harmonic frequencies. Misconception: people believe the timbre of an instrument is determined by its waveshape, which is not true. If we represent a note as s ( t ) = � K k =1 A k cos ( k ω 0 t + ψ k ). The timbre of an instrument is determined by the relative amplitudes A k but not the phases ψ k . Human ears are not sensitive to the phases. Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 7 / 28
Other statistics. A piece of music is about 60 secs to 30 mins. During one second, there are usually 1 - 30 notes. The frequency range is from 30 Hz to 4000 Hz, with two different pitches at least 10 Hz apart. For the ’wav’ file of the music, the typical sample rate is 44100 Hz. Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 8 / 28
Let us look at our problem again. Problem: Given a music signal played by various instruments, we would like to separate the melodies from different instruments and take down the music scores. A naive method: We record all the keys for all the instruments. On time domain, they form a finite set of basis. We use matching pursuit to solve this problem. In practice, this will not work! Reason: Though the instruments and keys are finite, there are many other factors. The stength to press the key, the duration, and the room geometry, will all show some effects. We cannot use basis on time domain! Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 9 / 28
Previous methods. Spectrogram. For all previous methods, the first step is to form a spectrogram. Spectrogram: STFT, WLT, WPT, CQT, Synchrosqueeze, etc. Early researches aim to improve the resolution of the spectrogram. At present, the resolution of the spectrogram is not a big problem. (Synchrosqueeze, Robust spectrotemporal decomp.) Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 10 / 28
Previous methods. NMF Many heuristic methods are proposed to extract features from spectrogram. Smaragdis and Brown (2003) proposed to perform non-negative matrix factorization on the absolute value of the spectrotogram. [ W , H ] = nmf ( abs ( S ) , k ). Here, S ∈ C n × m is the spectrogram, W ∈ R n × k , and H ∈ R k × m . Each column of W is a basis for the relative amplitude of a note. Each row of H is the time envelop of the corresponding note. Good performance on some simple examples. Now, it is often used in the unsupervised learning process to learn the relative amplitudes of different frequencies of a note. Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 11 / 28
Previous methods. NMF abs(S) W (: , 1) H (1 , :) Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 12 / 28
Previous methods. Linear model. Lasso. Given the learned basis, Lee, Yang, and Chen (2011) proposed to use matching pursuit to do the pitch selection and amplitude estimation. After estimation, they used the Maximum A Posterior algorithm to smoothify the result. This state of art result is with 70% accuracy rate. Why fail on 30%? Observation: nearly all algorithms are good when notes are well-separated in the time-frequency domain. Difficulty emerges when notes overlap on time frequency domain. E.g. C4 and C5, C4 and G4, C4 piano and C4 violin. Linear model often fails to discover a pitch. The linear superposition of absolute value of spectrogram assumption is wrong! Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 13 / 28
The assumption of linear model is wrong. Let s 1 and s 2 be two notes. K � A k e i ψ k δ ( ω − 2 π kf 1 ) , ˆ s 1 ( ω ) = k =1 K � B k e i φ k δ ( ω − 2 π kf 2 ) , ˆ s 2 ( ω ) = k =1 ˆ s ( ω ) = a · ˆ s 1 + b · ˆ s 2 , | ˆ s ( ω ) | = a | ˆ s 1 | + b | ˆ s 2 | . When f 1 / f 2 is not rational. | ˆ s ( ω ) | � = a | ˆ s 1 | + b | ˆ s 2 | . When f 1 / f 2 is rational. Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 14 / 28
Our work. Intuitions. Music intuitions. When we learn the basis of every note, only the relative amplitude is reliable, and the relative phase information is not reliable. Math intuitions. The superposition is not on the absolute value of the spectrogram. We need to consider the phase! Conclusion. Though the phase information cannot be learned previously, we need to consider the phase in the model! Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 15 / 28
Model formulation. Stationary case. We observe a stationary ocilliatary signal with noise K j J J � � � s ( t ) = α j s j ( t ) + ε ( t ) = α j A jk cos ( k ω j t + ψ jk ) + ε ( t ) , j =1 j =1 k =1 where each α j ∈ R + and ψ jk ∈ [0 , 2 π ) are unknown, and A jk ∈ R + are known. The problem is to estimate each α j and ψ jk . The interesting cases would be that for different j , ω j shares rational ratio. To simplify our notations, we assume ω j = ω 0 , j = 1 , 2 , . . . , J . (Or we use their gcd.) Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 16 / 28
Model formulation. Stationary case. The Fourier transform of (analytical) s ( t ) gives K j J � � α j A jk e i ψ jk ) δ ( ω − k ω 0 ) + ˆ s ( ω ) = ˆ ( ε ( ω ) . k =1 j =1 s ( k ω 0 ) ∼ � J j =1 α j A jk e i ψ jk . We can see ˆ Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 17 / 28
Model formulation. Stationary case. The parameter we need to estimate is α j . We propose to solve the following optimization problem K J α j A jk e i ψ jk | ) 2 + λ � α � 1 , � � min ( | ˆ s ( k ω 0 ) | − | α j ,ψ jk (1) k =1 j =1 subject to α j ≥ 0 , j = 1 , 2 , . . . , J . j =1 α j A jk e i ψ jk | ) 2 may not be The minimal point of � K s ( k ω 0 ) − | � J k =1 (ˆ unique. So we add on l 1 penalty of α to find the most sparse solution. The relation of this with linear model is that, in linear model, ψ jk are constrained to be 0. How to solve this problem? Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 18 / 28
Non-convex problem to Convex problem. Theorem The optimization problem 1 is equivalent to the convex optimization below K s ( k ω 0 ) | − C k ) 2 + λ � α � 1 , � min ( | ˆ C k ,α j k =1 subject to α j ≥ 0 , j = 1 , 2 , . . . , J , J (2) � C k ≤ α j A jk , j =1 J � max ( α l A jl − α j A jk ) ≤ C k . l j � = l The minimums of problem 1 and 2 are equal, and the minimal points ˆ α j are the same. This problem is extremely easy to solve using cvx. Song Mei (ICME, Stanford) Music transcription via convex optimization June 3, 2015 19 / 28
Recommend
More recommend