How We Built a Content-Based Filtering Recommender System For Music with PythonSammy LeeBlockedUnblockFollowFollowingMay 24Background: This project refers to Lambda Labs at Lambda School in which students spent the past 5 weeks building production-grade web applications, with some of them utilizing machine learning models as part of their backends.
Our group’s assigned task involved coming up with an app that would recommend no copyright music based on their moods.
I was part of the data science team that implemented a backend using tools like Python and Django.
If you want to see the finished product check out moodibeats.
com where you’ll find a catalogue of over 1000 copyright free music — some of them labeled by the machine learning model you’re about to see here.
Part I: A Brief Glimpse into Recommender SystemsWhenever I think of recommender systems I think of this movie:High Fidelity a movie that came out 19 years ago about a record store owner named Rob Gordon who’s employees are supposedly so knowledgeable about music that they actually stop customers from buying music they want to buy.
I mention this because before we had Netflix and Amazon and YouTube, real human beings in the flesh were the closest to personalized recommender systems we had.
The record store owner who knows what you like and recommends the newest Blink-182 or Green Day album, the restaurant server who’s tasted everything on the menu and knows exactly what you want based on what you had before, or the random stranger on the street who tells you the fastest and easiest way to get to a place you’re looking for — these are all recommender systems in the flesh — and very effective.
The problem is they don’t scale.
And couldn’t scale until the internet arrived with things like Google.
And even then there was no way to effectively evaluate the recommendation process until the arrival of data science and its ability to deal with lots of data.
Recommender systems in general can be divided into two types:Collaborative-Based Filtering: Serves recommendations based on User similarity — using kNN (k-Nearest Neighbor) or matrix-factorization algorithms.
Collaborative Filtering is the gold standard of personalized recommender systems, but you need lots and lots of User data which is why apps like YouTube and Amazon are able to do it so effectively.
Content-Based Filtering: Serves recommendations based on the meta-data or characteristics of the very thing you are trying to recommend.
If you’re recommending things like movies, then you would use genre, actors, directors, length of movie, etc.
as inputs to predict whether you’d like a movie or not.
For MoodiBeats we ended up going with Content-Based Filtering due to the limitations of our data.
Part II: A Glimpse into the beginnings of MoodiBeatsGoing back to our Lambda Labs project there was some decent amount of struggle within our team in the planning stages of MoodiBeats.
One of the major problems with trying to integrate machine learning into a web app that doesn’t exist yet is the chicken or the egg problem — how do you design a data science driven frontend without the actual data, and how do you get the data for a website whose specification you aren’t so sure about?Initially the data scientists wanted ready-made CSV files to work with so we spent almost two weeks analyzing the last.
fm dataset and the infamous FMA dataset.
Eventually, wanting to avoid anything having to do with copyright issues and the impracticality of letting Users download songs the FrontEnd team decided on using YouTube’s API and Player for no copyright music only.
This forced the data science team to completely scrap all the work done on the last.
fm and FMA datasets and refocus on trying to grab data from the YouTube v3 API in the middle of the project.
Part III: Let’s Build a barebones Django backend as a REST APIWarning 1: I’m going to rapidly build a Django backend with minimal explanation so readers who don’t have too much experience with Django, but are interested can consult the countless tutorials here on Medium or YouTube or here.
Warning 2: Parts III and IV will necessarily be long.
However, know that what we’re doing here is actually building out a machine learning pipeline that will:Automatically retrieve data from the YouTube API v3Run a machine learning modelExpose both YouTube data and the machine generated predictions (in this case moods) as RESTful endpointsAnd therefore non-trivial and perhaps highly useful for data scientists who want to collect a huge quantity of novel data and have it in a form accessible to the rest of the world.
*If you care only about the data science part go ahead and skip to Part V.
On the command line:"""Working within a virtual environment is highly recommended.
For this project either Conda or Pipenv is sufficient.
I'm using Python 3.
6 and Django 1.
11.
20 and PostgreSQL for this project.
"""$ mkdir MoodiBeatsAPI && cd MoodiBeatsAPI# psycopg2 is for interfacing with PostgreSQL database$ pip install Django==1.
11.
20 psycopg2# don't forget the trailing period$ django-admin startproject music_selector .
Now open the the project folder (MoodiBeatsAPI) in a text editor of your choice, lots of people use VS Code nowadays, I still use Sublime Text .
Django ships with SQLite3 as a database, but my preference is to always use PostgreSQL, so if you don’t already have PostgreSQL I suggest you install it on your system.
Your project structure should look exactly like this:.
├── manage.
py└── music_selector ├── __init__.
py ├── settings.
py ├── urls.
py └── wsgi.
pyFirst create your PostgreSQL database:$ psql -d postgrespostgres=# CREATE DATABASE moodibeats;# You can verify that the database has been created by running postgres=# l# And exitpostgres=# qGo into your settings.
py and make some changes:### Change this:DATABASES = { 'default': { 'ENGINE': 'django.
db.
backends.
sqlite3', 'NAME': os.
path.
join(BASE_DIR, 'db.
sqlite3'), }}### To this:DATABASES = { 'default': { 'ENGINE': 'django.
db.
backends.
postgresql_psycopg2', 'NAME': 'moodibeats', # Name of our database 'USER': 'sammylee', # This would be different for you 'PASSWORD': '', 'HOST': 'localhost', 'PORT': '5432', }}On the command line:$ python manage.
py migrate$ python manage.
py createsuperuser# Create your credentials$ python manage.
py runserver# Go to http://127.
0.
0.
1:8000/ on your web browser# Append 'admin' at the end of your localhost to gain access to your # Django admin appIf everything went smoothly, you should see this:Now let’s make our Django App which will be the main functioning core of our project.
All we’re doing here is creating a database to hold our data and expose it as an endpoint so views.
py and the Django template system won’t be necessary.
On the command line:$ python manage.
py startapp songsThen add your new ‘songs’ app to settings.
py under INSTALLED_APPS:INSTALLED_APPS = [ # .
'songs',]You’re project structure should now look like this:Let’s create a database table for our new songs app:models.
py# songs/models.
pyclass NewVideo(models.
Model): MOOD_CHOICES = ( ('HAPPY', 'Happy'), ('IN-LOVE', 'In-Love'), ('SAD', 'Sad'), ('CONFIDENT-SASSY', 'Confident-sassy'), ('CHILL', 'Chill'), ('ANGRY', 'Angry'), ) video_title = models.
TextField(db_index=True, null=True, blank=True) video_id = models.
CharField(max_length=11, null=False, blank=True, primary_key=True) moods = models.
CharField(choices=MOOD_CHOICES, max_length=20, default='HAPPY') labeled = models.
NullBooleanField() video_description = models.
TextField(null=True, blank=True) predicted_moods = models.
CharField(max_length=17, null=True, blank=True) def __str__(self): return self.
video_titleThe six moods you see above [happy, sad, confident-sassy, in-love, chill, angry] will be the moods that our machine learning model will try to predict and eventually expose as a REST endpoint.
This NewVideo model is also where we’re going to hold the training data for our backend.
And the data will come from a python function that will make a series of calls to the YouTube v3 API and automatically save to our database.
Now in admin.
py# songs/admin.
pyfrom .
models import NewVideoclass NewVideoAdmin(admin.
ModelAdmin): list_display = [ 'video_id', 'video_title', 'moods', 'labeled', 'predicted_moods',] search_fields = [ 'video_id', 'video_title', 'moods', ] list_editable = [ 'moods', 'labeled', ]admin.
site.
register(NewVideo, NewVideoAdmin)Then on the command line:$ python manage.
py makemigrations$ python manage.
py migrateIf you made it this far pat yourself on the back we’re halfway there to making this RESTful.
For those who are new to RESTful web APIs, the simplest way to think about it is it’s just a way for a backend web application to expose its database as JSON.
It’s considered one of the most important inventions in software engineering, but super easy to grasp and use once you start implementing them yourself.
For this we’re going to need the Django REST Framework which we can just overlay on top of our project.
On the command line:$ pip install djangorestframework==3.
8.
2# And add to INSTALLED_APPS on settings.
pyINSTALLED_APPS = [ # .
'rest_framework',]Now we’re actually going to create a separate “api” app that will hold all of our API-related code:$ python manage.
py startapp apiInside of api, create a serializers.
py file:# api/serializers.
pyfrom rest_framework import serializersfrom songs.
models import NewVideoclass NewVideoSerializer(serializers.
ModelSerializer): class Meta: model = NewVideo fields = [ 'video_title', 'video_id', 'moods', ]Then in your views.
py inside of the api app:# api/views.
pyfrom django.
shortcuts import renderfrom rest_framework import genericsfrom songs.
models import NewVideofrom .
serializers import NewVideoSerializer# Create your views here.
class NewVideoAPIView(generics.
ListCreateAPIView): queryset = NewVideo.
objects.
all() serializer_class = NewVideoSerializerNow create a urls.
py inside of the api app and:# api/urls.
pyfrom django.
conf.
urls import urlfrom .
views import NewVideoAPIViewurlpatterns = [ url(r'^new-videos/$', NewVideoAPIView.
as_view()),]Then in your urls.
py inside of the project configuration folder (music_selector):# music_selector/urls.
pyfrom django.
conf.
urls import url, includefrom django.
contrib import adminurlpatterns = [ url(r'^admin/', admin.
site.
urls), url(r'^api/', include('api.
urls', namespace='api')),]Now go to your Django admin and create a NewVideo object by inserting some data:And now run your local server and go to this endpoint on your browser:$ python manage.
py runserver# Point your browser to http://127.
0.
0.
1:8000/api/new-videos/One of the most awesome traits of the Django REST Framework is its browsable API which, if everything went okay, should look like so:Congratulations — you’ve created a RESTful endpoint!The importance of having created this is that we can now expose our data in a way that can be consumed as an API by things like React-made FrontEnds which is exactly what we did in our MoodiBeats project.
Part IV: Let’s Create A Training DatasetOne of the coolest parts about Django is the ability to create what are called Management Commands — functions that an admin user can run on the command line:$ python manage.
py do_somethingFor the data science portion of MoodiBeats we needed a function that would grab data from the YouTube API and populate the database.
In order to do this we created a management command in Django to grab video_id, and video_title, and set something called Heroku Scheduler on our Heroku server to run that same function once every 24 hours — effectively creating what are called Cron Jobs.
To accomplish this you must go through these steps:Create a folder inside of the songs app called managementInside of management create a file called __init__.
pyAlso inside of management create another folder called commandsThen inside of commands create another file called __init__.
pyAnd finally inside of commands, create a file called get_new_videos.
pyGet your self a YouTube v3 API Key$ pip install python-dotenvMake sure you have python-dotenv installed if you actually want to run this management command.
Then inside of your top-level directory, create a file called .
env where you want to store things like SECRET_KEY or YouTube API Key, and then add .
env to your .
gitignore file.
You want to do these things if you ever want to commit your project to a GitHub Repo.
Remember, never ever commit things like API Keys or anything personal to your GitHub Repo or any public place.
You risk very bad things happening from very bad people if you do.
get_new_videos.
py# Credit goes to my data science teammate John Humphreys.
for writing # this functionfrom django.
core.
management.
base import BaseCommandfrom googleapiclient.
discovery import buildfrom dotenv import load_dotenvimport osimport jsonfrom songs.
models import NewVideoload_dotenv()def html_reverse_escape(string): '''Reverse escapes HTML code in string into ASCII text.
''' # see Ned Batchelder post https://stackoverflow.
com/questions/2077283/escape-special-html-characters-in-python return (string .
replace("&", "&").
replace("'", "'").
replace(""", '"'))def search_api(): '''Searches YouTube Data API v3 for videos based on project-specified parameters; returns list of videos.
''' api_service_name = 'youtube' api_version = 'v3' DEVELOPER_KEY = os.
getenv('DEVELOPER_KEY')youtube = build(api_service_name, api_version, developerKey = DEVELOPER_KEY)request = youtube.
search().
list( part='id,snippet', maxResults=20, q='instrumental edm', relevanceLanguage='en', type='video', videoDuration='medium', videoLicense='creativeCommon', videoSyndicated='true', ).
execute()videos = []result_count = 0for search_result in request['items']: video_title = search_result['snippet']['title'] video_title = html_reverse_escape(video_title) video_id = search_result['id']['videoId'] video_description = search_result['snippet']['description'] try: new_videos = NewVideo(video_id=video_id, video_title=video_title, video_description=video_description, predicted_moods=predicted_moods)new_videos.
save() except: passclass Command(BaseCommand): def handle(self, *args, **options): print("Pulling data from YouTube API and saving") search_api()If you study the function it’s set to retrieve a maximum of 20 results from the YouTube API.
We purposely set q to ‘instrumental edm’ and videoLicense to ‘creativeCommons’ because we want only the no copyright music videos.
Let’s run the command$ python manage.
py get_new_videosNow run your local Django server and go back to your admin.
You should see something like this:And if you click on the video_id to get into the Detail View, you should see video descriptions in the description field.
Doing this with different query parameters over a period of about 7 days ended up giving us a database of over 1000 songs.
Before we move onto the data science portion of the project what you want to do in order to complete a training set is LABEL your data.
This turned out to be a rude awakening for me as I ended up labeling over 800 YouTube videos with the “correct” moods.
My advice to data science students is this: Don’t get used to the UCI data repository.
Part V: Time to Data Science the Heck Out of Our DataWe’re going to be using data science tools so make sure you have things like Conda (I suggest miniconda), Jupyter, Pandas, and Scikit-Learn installed in your virtual environment.
Now that we finally have data it’s time to transform them into the neat little CSV files we expected at the beginning.
We’re going to connect to our postgreSQL database, take a peek at the table schema, and run a simple copy command to a folder on our desktop that I’ve created called CSV.
$ psql -d postgrespostgres=# c moodibeats;moodibeats=# dtmoodibeats=# SELECT * FROM songs_newvideo LIMIT 5;copy songs_newvideo(video_id,video_title,moods,labeled,video_description,predicted_moods) TO '/Users/sammylee/desktop/CSV/videos.
csv' DELIMITER ',' CSV HEADER;Now start-up your Jupyter Notebook, import pandas, and have a look at the data:# In the same directory which contains your 'videos.
csv' file$ jupyter notebookThenimport pandasvideos = pd.
read_csv('videos.
csv', encoding='utf-8')video.
head()Of course, in my case we had a lot more data, and most of them were labeled, but this is exactly what we want.
Let’s recap on what we’ve accomplished so far:We structured our problem (Content-based Filtering), and put a plan in place to build a Django backend for data science to be used by a React frontend.
We then built the backend using Django REST FrameworkUtilized the YouTube v3 API to retrieve dataWe’ve effectively created a first-pass, mini-pipeline for the data-science portion of MoodiBeats.
We also need to remember that we’re using text data for our analysis and that means we need to use a different set of tools for our machine learning.
The most important of these different tools is the Bag-of-Words model which allows us to represent text as vectors of numbers.
The bag of words model is basically a two-step process of first tokenizing a document of text, and then transforming them into feature vectors of word counts.
For example"""This is taken straight out of Sebastian Raschka & Vahid Mirjalili's Python Machine Learning 2nd Edition, and in fact this project owes a debt of gratitude to their chapter 8 on Sentiment Analysis"""import numpy as npfrom sklearn.
feature_extraction.
text import CountVectorizercount = CountVectorizer()docs = np.
array([ 'The sun is shining', 'The weather is sweet', 'The sun is shining and the weather is sweet'])bag = count.
fit_transform(docs)# print out the vocabulary of our bag-of-words:print(count.
vocaulary_)This will return a Python dictionary with unique words and their index positions:{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}Furthermore, we can see the feature vectors created for us by the bag-of-words model:print(bag.
toarray())# Returns[[0 1 1 1 0 1 0] [0 1 0 0 1 1 1] [1 2 1 1 1 2 1]]The first column of this array [0,0,1] at index position 0 represents the word ‘and’ which occurs only in the third sentence hence [0,0,1].
The second column [1,1,2] represents ‘is’, and we can see that ‘is’ occurs once on the first two sentences and twice in the last sentence.
The second piece of tool we need is the actual machine learning algorithm.
For MoodiBeats we settled on a Logistic Regression classifier.
Sigmoid function for Logistic regressionWhat we’re essentially doing is building a non-personalized content-based filtering recommender system by taking meta-data about YouTube music videos (in this case video descriptions in the form of text) and using that to predict moods of the videos themselves.
Logistic Regression tries to classify a sample based on its probability function — an S-shaped function called the sigmoid-function.
A threshold of 0.
5 is set, probabilities that go over this threshold are classified as one, and probabilities that come under are classified as 0.
Using this approach on numerical representations of video descriptions allowed us to find probabilities of moods and pick the one with the highest probability as our prediction (One-vs-Rest method).
Another important consideration is that almost always, when dealing with text data, you’ll be dealing with dirty data HMTL, emojis, punctuation, etc.
We will write a function to preprocess dirty data as well.
So here we’re combining text cleaning and text tokenizing to give us data that we can then run a Logistic regression classifier on.
"""A simple text cleaning function borrowed from Sebastian Rachska""" import refrom nltk.
corpus import stopwordsstop = stopwords.
words('english')def tokenizer(text): text = re.
sub('<[^>]*>', '', str(text)) emoticons = re.
findall('(?::|;|=)(?:-)?(?:)|(|D|P)', text.
lower()) text = re.
sub('[W]+', ' ', text.
lower()) + ' '.
join(emoticons).
replace('-', '') tokenized = [w for w in text.
split() if w not in stop] return tokenizedThe next step before actually training is to encode our target variable, meaning transform our moods [HAPPY, CONFIDENT-SASSY, SAD, ANGRY,CHILL, IN-LOVE] into numbers that our algorithm can use.
Here, we’re assuming that you’ve gone through your data in the Django admin, watched all the YouTube music videos, and correctly labeled them before sending it over to a Pandas DataFrame to be analyzed as training data.
from sklearn import preprocessingle = preprocessing.
LabelEncoder()videos['moods_enc'] = le.
fit_transform(videos['moods'])df.
head()Here is the core of our training in final form:from sklearn.
feature_extraction.
text import HashingVectorizerfrom sklearn.
model_selection import GridSearchCVfrom sklearn.
linear_model import LogisticRegressionvect = HashingVectorizer(decode_error='ignore', n_features=2**21, preprocessor=None, tokenizer=tokenizer)param_grid = {'C': [0.
001, 0.
01, 0.
1, 1, 10, 100]}lr = LogisticRegression()grid_search = GridSearchCV(lr, param_grid, cv=5)X_train = vect.
transform(X_train.
tolist())grid_search.
fit(X_train, y_train)X_test = vect.
transform(X_test)print('Test set accuracy: {:.
3f}'.
format(grid_search.
score(X_test, y_test)))print("Best parameters: {}".
format(grid_search.
best_params_))print("Best cross-validation score: {:.
2f}".
format(grid_search.
best_score_))Training this model on our larger YouTube API dataset gave us:Test set accuracy: 0.
368Best parameters: {'C': 10}Best cross-validation score: 0.
42While these results aren’t great, we basically ran out of time to collect more data and re-train our model as our projects were put on a state of feature freeze.
And in fact, our presentation to the entire school is only an hour away as I write these words.
The next challenge involved putting the model into production.
What we did is automate this process by inserting our trained model inside of the Django app to be run as a Management Command:$ python manage.
py get_new_videosYou can set this function as a Cron Job on a Heroku Scheduler in production to automatically run your predictions alongside making the call to the YouTube API.
I’m going to skip this part for the purpose of brevity, but let me know in the comments if you’d like to learn about it.
Part VI: Lessons LearnedCheck this song outAn awesome band I discovered as a freshman in collegeIf I asked to categorize this as one of our six moods:[HAPPY, CHILL, CONFIDENT-SASSY, SAD, ANGRY, IN-LOVE]I’m pretty sure most of you would answer that this song clearly expresses anger to the point of reaching a state of rage.
But what if I gave you something like this:What’s going on here?A song at a general-audience level has two dimensions (song, lyrics), but a music video has three(song, lyrics, visual).
On first inspection the song sounds pretty CHILL.
And the visuals are kind of goofy (check out all the comments about small budgets and green screens).
Now check out some of the song’s lyrics:[Verse 1: Ed Sheeran]I’m at a party I don’t wanna be atAnd I don’t ever wear a suit and tie, yeahWonderin’ if I could sneak out the backNobody’s even lookin’ me in my eyesCan you take my hand?Finish my drink, say, “Shall we dance?” (Hell, yeah)You know I love ya, did I ever tell ya?You make it better like that[Pre-Chorus: Ed Sheeran]Don’t think I fit in at this partyEveryone’s got so much to say (Yeah)I always feel like I’m nobody, mmmWho wants to fit in anyway?[Chorus: Ed Sheeran]‘Cause I don’t care when I’m with my baby, yeahAll the bad things disappearAnd you’re making me feel like maybe I am somebodyI can deal with the bad nightsWhen I’m with my baby, yeahOoh, ooh, ooh, ooh, ooh, ooh‘Cause I don’t care as long as you just hold me nearYou can take me anywhereAnd you’re making me feel like I’m loved by somebodyI can deal with the bad nightsWhen I’m with my baby, yeahOoh, ooh, ooh, ooh, ooh, ooh[Verse 2: Justin Bieber]We at a party we don’t wanna be atTryna talk, but we can’t hear ourselvesPress your lips, I’d rather kiss ’em right backWith all these people all aroundI’m crippled with anxietyBut I’m told it’s where we’re s’posed to beYou know what?.It’s kinda crazy ’cause I really don’t mindAnd you make it better like that[Pre-Chorus: Justin Bieber]Don’t think we fit in at this partyEveryone’s got so much to say, oh yeah, yeahWhen we walked in, I said I’m sorry, mmmBut now I think that we should stayIs this song really CHILL?.SAD?.or even IN-LOVE?How do you get a machine to learn all the complexities of an emotion like In-Love?.Love is the Neapolitan ice-cream of human emotions.
There’s the happy kind of love where everything’s going good and two people can’ t stand to be away from each other.
There’s the unrequited kind of love where the emotions are felt by just one person.
There’s even a kind of love reserved for long-time married couples where things aren’t crazy, but kind of just settled.
Moods are super-subjective, and therefore a super-difficult problem for machines to solve.
But super-interesting nevertheless.
Part VII: A Note of ThanksI’d like to thank my teammates on project MoodiBeats for Lambda School:Jonathan Bernal, John Humphreys, Xander Jake de los Santos, Md Kawsar Hussen, Logan Hufstetler, Davina Taylor, and our project manager Kevin Brack.
Building MoodiBeats was one of the best, and fun experiences I’ve had at Lambda School.
Our team was amazing and I won’t forget this experience.
Code for this Medium post lives here:Captmoonshot/moodibeats_api_mediumFor Medium post .
Contribute to Captmoonshot/moodibeats_api_medium development by creating an account on GitHub.
github.
com.