Audio AI: isolating vocals from stereo music using Convolutional Neural Networks

(you might be asking yourself at this point)oh Lord… that was something.

I’m gonna address this at the end of the article so that we don’t switch contexts yet!If our model learns well, during inference, all we need to do is implement a simple sliding window over the STFT of the mix.

After each prediction, we move our window to the right by 1 time-frame, predict the next vocal frame and concatenate it with the previous prediction.

In regards to the model, we can start by using the same model we used for VAD as a baseline and by making some changes (output shape is now (513,1), linear activation at the output, MSE as loss function), we can begin our training.

Don’t claim victory yet…Although the above input/output representation makes sense, after training our vocal separation model several times, with varying parameters and data normalizations, the results are not there yet.

It seems like we are asking for too much…We went from a binary classifier to trying to do regression on a 513-dimensional vector.

Although the network learns the task to a degree, after reconstructing the vocal’s time domain signal, there are obvious artifacts and interferences from other sources.

Even after adding more layers and increasing the number of model parameters, the results don’t change much.

So then the question then became: can we trick the network into thinking it is solving a simpler problem and still achieve the desired results?What if instead of trying to estimate the vocal’s magnitude STFT, we trained the network to learn a binary mask that, when applied to the STFT of the mix, gives us a simplified but perceptually-acceptable-upon-reconstruction estimate of the vocal’s magnitude spectrogram?By experimenting with different heuristics, we came up with a very simple (and definitely unorthodox from a Signal Processing perspective…) way to extract vocals from mixes using binary masks.

Without going too much into the details, we are going to think of the output as a binary image where, a value of ‘1’ indicates predominant presence of vocal content at a given frequency and timeframe, and a value of ‘0’ indicates predominant presence of music at the given location.

We may call this perceptual binarization, just to come up with some name.

Visually, it looks pretty unattractive to be honest, but upon reconstructing the time domain signal, the results are surprisingly good.

Our problem now becomes some sort of regression-classification hybrid (take this with a grain of salt…).

We are asking the model to “classify pixels” at the output as vocal or non-vocal, although conceptually (and also in terms of the loss function used -MSE- ), the task is still a regression one.

Although the distinction might not seem relevant to some, it actually makes a big difference in the model’s ability to learn the assigned task, the second one being way more simple and constrained.

At the same time, it allows us to keep our model relatively small in terms of number of parameters considering the complexity of the task, something highly desired for real-time operation, which was a design requirement in this case.

After some minor tweaks, the final model looks like this.

How do we reconstruct the time-domain signal?Basically, as described in the naive method section.

In this case, for every inference pass that we do, we are predicting a single timeframe of the vocals’ binary mask.

Again, by implementing a simple sliding window with a stride of one timeframe, we keep estimating and concatenating consecutive timeframes, which end up making up the whole vocal binary mask.

Creating the training setAs you know one of the biggest pain points in supervised Machine Learning (leave aside all those toy examples with available datasets out there) is having the right data (amount and quality) for the particular problem that you’re trying to solve.

Based on the input/output representations described, in order to train our model, we first needed a significant number of mixes and their corresponding, perfectly aligned and normalized vocal tracks.

There’s more than one way to build this dataset and here we used a combination of strategies, ranging from manually creating mix <> vocal pairs with some acapellas found online, to finding RockBand stems, to web-scraping Youtube.

Just to give you an idea of part of this definitely time-consuming and painful process, our “dataset project” involved creating a tool to automatically build mix <> vocal pairs as illustrated below:We knew we needed a good amount of data for the network to learn the transfer function needed to map mixes into vocals.

Our final dataset consisted of around 15M examples of ~300-millisecond fragments of mixes and their corresponding vocal binary masks.

Pipeline architectureAs you probably know, building a Machine Learning model for a given task is only part of the deal.

In the real world, we need to think about software architecture, especially when we’re dealing with real-time.

For this particular problem the reconstruction into the time-domain can be done all at once after predicting a full vocal binary mask (offline mode) or, more interestingly, as part of a multithreaded pipeline where we acquire, process, reconstruct and playback in small segments, allowing this to be streaming-friendly, and even capable of delivering real-time deconstruction on music that’s being recorded on the fly, with minimum latency.

Given this is a whole topic on its own, I’m going to leave it for another article focused on real-time ML pipelines…I think I’ve covered enough so why don’t we listen to a couple more examples!?Daft Punk — Get Lucky (Studio)we can hear some minimal interference from the drums here…Adele — Set Fire to the Rain (live recording!)Notice how at the very beginning our model extracts the crowd’s screaming as vocal content :).

In this case we have some additional interference from other sources.

This being a live recording it kinda makes sense for this extracted vocal not to be as high quality as the previous ones.

Ok, so there’s ‘one last thing’ …Given this works for vocals why not apply it to other instruments…?This article is extensive enough already but given you’ve made it this far I thought you deserved to see one last demo.

With the exact same reasoning for extracting vocal content, we can try to split a stereo track into STEMs (drums, bassline, vocals, others) by making some modifications to our model and of course, by having the appropriate training set :).

If you are interested in the technical details for this extension, just leave me some comments.

I will consider writing a ‘part 2’ for the STEM deconstruction case when time allows!Thanks for reading and don’t hesitate in leaving questions.

I will keep writing articles on Audio AI so stay tuned!.As a final remark, as you can see, the actual CNN model we ended up building is not that special.

The success of this work has been driven by focusing on the Feature Engineering aspect and by implementing a lean process for hypotheses validations, something I’ll be writing about in the near future!ps: shoutouts to Naveen Rajashekharappa and Karthiek Reddy Bokka for their contributions to this work!.

. More details

Leave a Reply