(It’s worth noting that although this probably would have taken a few hours of mind-numbing clicking, figuring out how scrape the dynamic page would likely have taken much longer.
)Fortunately, I had a better idea that paid off and saved me much tedium.
On the Boulder Open Data page I found contact info for the person in charge of maintenance for the FAQ-related datasets.
After just a few emails she was able to send me a file from their webmaster that included the content of each FAQ page, labeled by Category, Department, and Topic!This asset was pure gold in the early stages of the project and continued to pay off as my contact offered further help by suggesting similar datasets, related projects, and the possibility to present my ChatBot to the Boulder Data Team and host it on their Data Showcase page.
So thanks a bunch, Nicolia!Lesson learned: Before you sink a chunk of time into a menial scraping task, take a few minutes to look around and see if you can go straight to the source.
Many municipal governments publish public data sets and someone has to be in charge of handling it all.
Ask nicely and you never know what data you might turn up….
In addition, the social connection may be even more rewarding.
Data CleaningGetting this file was huge boost, but it wasn’t over yet.
After converting the file from Excel into CSV format, I still had work to do in cleaning up the data and getting it to the final state: a column of questions and a corresponding column of answers.
I converted each entry into unicode text, removed artifacts, and utilized a feature of the spaCy library to separate each sentence within the entry.
While this wasn’t a perfect fix, it got me a massive head start on working with clean data, enough so that I was able to continue experimenting with NLP and ML techniques.
Later on I returned to cleaning by picking over the entries and hand-editing some remaining errors caused by addresses, nonstandard grammar, odd punctuation, and styling.
For the pièce de résistance I wrote a complicated Regular Expression to separate question-and-answer pairs from a block of text and then I put them into two separate columns.
Skip to the end: Depending on your own needs, you may be able to stop here and take your tidy q-n-a spreadsheet to one of the chatbot platforms like Microsoft QnA Maker or Google Dialogflow that will plug your data into a prebuilt interface including chitchat responses and out-of-the-box Machine Learning.
Of these two options in default settings, QnA Maker gave me slightly better results.
Pricing will vary, so do your homework before you commit.
I provide some tips from my experience at the end of this blog.
TFW words fail you.
Natural Language ProcessingAfter preprocessing and hand-cleaning, my pipeline for NLP is to remove punctuation and stop words, lowercase everything, and then lemmatize the remaining words.
I found lemmatization to be significantly more effective than mere stemming.
The reduction of words to their most basic form contributes to broader similarity matching that helps with the chatbot’s accuracy rate in responding to queries.
The NLTK library has some easy options to help.
The next step is to transform the documents into the numerical vectors that ML models understand.
To begin with I used Bag-of-Words, a large, sparse matrix with binary values to indicate whether each word from the corpus exists in the document.
From here, the BOW vectors are transformed into Term Frequency — Inverse Document Frequency vectors, which represent how important each word is to the corpus based on it’s frequency within a document and its frequency within the entire collection of documents.
For my dataset, therefore, the word ‘Boulder’ has a relatively low TF-IDF value due to its prevalence throughout the documents.
For an FAQ about a different city, you would expect ‘Boulder’ to carry more weight.
(No pun intended?)Tools of the Trade: Bag-of-Words and TF-IDF vectorization are pretty standard for any NLP project, and provide a solid base from which to work toward more complex actions.
Most of the following matching strategies involve applying various functions to the TF-IDF vectors.
Similarity MatchingFor matching user questions to answers in the dataset, I experimented with several different approaches:Semantic similarity is in included in the the NLP library spaCy via pre-trained models that consider words as similar if they are used in similar contexts.
In my experience, the ‘medium’ sized spaCy model was nearly as good as the ‘large’ one at assigning similarity, and significantly quicker to load.
The implementation for this library was fast and easy (even vectorization is included in spaCy), however the immediate results were not particularly impressive and the large models might hurt your wallet when it comes to deployment.
Cosine similarity is a measurement of the cosine of the angle between two vectors.
The application here is to compute the cosine for the angle between the user query vector and each vector in the dataset.
The cosine closest to 1 is the best match for the user query.
I built this comparison from existing parts of numpy and scikit-learn, which was straightforward enough and has the benefit of not requiring a bulky model.
I found the results to be pretty decent and ended up using this lean approach in my MVP bot.
KD Tree is a data structure represented in scikit-learn that is useful for storing K-Nearest Neighbor relationships.
This static storage saves you from having to recompute distances each time you compare vectors.
The distance metrics in scikit-learn’s version don’t support cosine similarity, which would be an ideal extension of that strategy, but some of the included metrics (euclidean, l2, minkowski, and p) worked just as well in my test cases.
Doc2Vec by Gensim is an extension of the Word2Vec approach that uses neural networks to learn embeddings for entire documents rather than just individual words.
I found the set-up for this method to be significantly more complicated than the previous ones, and the results to be inferior to basic cosine similarity and KDTree.
However, if you’re comfortable with neural networks and willing to dig into the details then this is a promising angle.
Something to note: most FAQ’s won’t contain enough text to get robust training on their own, and there’s not a lot of support for pre-trained Doc2Vec models, but here’s a start.
Keep It Simple, Stupid: There are loads of different approaches for similarity matching.
Again, follow the path of least resistance and look for a library that has built-in functionality for the strategy you want to use.
There’s no need to reinvent the wheel already.
When you get your bot ready for deployment you can come back and revisit your options to improve accuracy.
My advice is to build your pipeline as separate modules so that you can easily swap in new options without rewriting your whole program.
This modularity will also help you test and compare the outcomes so you know you’re making informed decisions about which function to choose.
300 days of sunshine is great, but the Cloud is pretty cool, too.
Web Service and DeploymentWith your pipeline able to return relevant responses, the next step is getting the bot online so people can actually use it.
Flask is a Python micro-framework that can make web service for your chatbot surprisingly simple by hosting a basic HTML page on a local URL with a form that submits a user query as a POST request to a Flask route function, which returns the response from your similarity matching algorithm and renders it on the page.
Sharing is Caring: The real power of web service is in putting your bot out on the internet where it can be accessed by the general public and by other programs.
Simply having the local Flask service gets you half way there.
The other half of serving your bot on the web is to make it publicly accessible.
Tools like ngrok will technically do this, but on a small scale that requires your own computer to do the hosting.
Much more useful are the modern cloud services offered by Google, Amazon, Microsoft, etc.
that will take care of hosting for you (for a fee, of course.
)I worked with Google Cloud Services to host my Flask app by wrapping it in a Docker container and running that on a load-balanced Kubernetes cluster to allow for scaling.
(The Google App Engine is an alternative that may prove easier for some cases.
) This step finally gives the bot a public URL.
Share it with your friends!InterfaceIf you want to build out your own web page interface, this is the place to do it.
I took a hybrid approach here, and uploaded my q-n-a dataset to Google Dialogflow as a KnowledgeBase intent.
Then I connected the Knowledge intent to my web service on GCP by using the app’s public URL as a webhook.
Now, whenever a user queries the Dialogflow agent about topics in the KnowlegeBase, the agent with send a webhook request to my web service app behind the scenes and receive the response determined by my custom similarity matching.
The benefit of my hybrid approach is that users can interact with the Dialogflow interface (including small talk responses) and Google’s Machine Learning for the Knowledge intent acts as a fallback response if my custom bot server is down.
Proud Parent: I should point out here — As basic as my similarity algorithm is, it still outperforms any of the out-of-the-box ML platforms from the massive Tech Corps services I tested.
It’s all about the Benjamins/PentiumsSome Words about the Cost:Google Cloud Products: The sign-up credit of $300 dollars is intended to help absorb costs while you’re getting the hang of things, but it still pays to be careful about how much weight you’re putting on your services so that you avoid the classic horror story about skyrocketing costs.
The services I used required billing to be enabled.
I burned through about $15 in a couple of weeks as I tested out Docker, clusters, and swarms.
One day accounted for about $10 by itself, and it can be tricky to track where the money is going, so be careful!.In the end, a chatbot with a small user base should be fairly inexpensive to run.
I’ll provide an update later on with more details about how long the credit lasted.
Dialogflow: The Standard Edition is free, the Enterprise Edition has pay-per-use requirements.
Here’s a comparison chart, but even the Standard Edition quotas should be enough to get you started.
There’s lots of little bells and whistles for this platform including voice support and integrations with other messaging apps like Facebook and Slack.
Microsoft QnA Maker: This web service on the Microsoft Azure platform works pretty well out-of-the-box if you just upload an Excel file with two columns: Questions and Answers.
For my small test set, the ML responses were more accurate than Dialogflow’s built-in ML, although the interface wasn’t as user-friendly.
I ran this for exactly one month continuously on the $200 free sign-up credit (with very little traffic), but I suggest finding a way to disable the service while you’re setting things up in order to stretch your credit further.
Here’s the pricing page for this service.
Next episode…In my next post I’m going to dive further into the gritty details and provide a guide that traces the path to deploy this simple Flask app to a GCP Kubernetes cluster via Docker.
I plan to address some gaps that seemed to be missing from other tutorials, so stay tuned if you’re interested in this topic.
Let me know if you have any questions or just want to chat about bots!.