Chatbots aren’t as difficult to make as You Think

Let’s do some python magic to make it responsive.

Making our Telegram Chatbot responsiveCreate a file main.

py and put the following code in it.

Don’t worry most of the code here is Boilerplate code to make our Chatbot communicate with Telegram using the Access token.

We only need to worry about implementing the class SimpleDialogueManager.

This class contains a function called generate_answer which is where we will write our bot logic.

Simple main.

py codeNow, you can run the file main.

py in the terminal window to make your bot responsive.

$ python main.

pyA very Naive ChatbotNice.

It is following simple logic.

But the good thing is our bot now does something.

Congratulate yourself a little bit if you reached here.

What we have accomplished here is not generic.

Also, take a look at the terminal window where we have run our main.

py File.

Whenever a user asks a question, we get the sort of dictionary below that contains Unique Chat ID, Chat Text, User Information, etc.

which we can use as per our requirements later.

Update content: {'update_id': 484689748, 'message': {'message_id': 115, 'from': {'id': 844474950, 'is_bot': False, 'first_name': 'Rahul', 'last_name': 'Agarwal', 'language_code': 'en'}, 'chat': {'id': 844474950, 'first_name': 'Rahul', 'last_name': 'Agarwal', 'type': 'private'}, 'date': 1555266010, 'text': 'What is 2+2'}}Until now whatever we had done was sort of set up and engineering sort of work.

Now if we can write some sound Data Science logic in the generate_answer function in our main.

py we should have a decent chatbot.

2.

ChatterBotFrom the Documentation:ChatterBot is a Python library that makes it easy to generate automated responses to a user’s input.

ChatterBot uses a selection of machine learning algorithms to produce different types of reactions.

This makes it easy for developers to create chat bots and automate conversations with users.

Simply.

It is a Blackbox system which can provide us with responses for Chitchat type questions for our Chatbot.

And the best part about it is that it is pretty easy to integrate with our current flow.

We could also have trained a SeqtoSeq model to do the same thing.

Might be I will do it in a later post.

I digress.

So, let’s install it with:$ pip install chatterbotAnd change the SimpleDialogueManager Class in main.

py to the following.

We can have a bot that can talk to the user and answer random queries.

class SimpleDialogueManager(object): """ This is a simple dialogue manager to test the telegram bot.

The main part of our bot will be written here.

""" def __init__(self): from chatterbot import ChatBot from chatterbot.

trainers import ChatterBotCorpusTrainer chatbot = ChatBot('MLWhizChatterbot') trainer = ChatterBotCorpusTrainer(chatbot) trainer.

train('chatterbot.

corpus.

english') self.

chitchat_bot = chatbot def generate_answer(self, question): response = self.

chitchat_bot.

get_response(question) return responseThe code in init instantiates a chatbot using chatterbot and trains it on the provided english corpus data.

The data is pretty small, but you can always train it on your dataset too.

Just see the documentation.

We can then give our responses using the Chatterbot chatbot in the generate_answer function.

Not too “ba a a a a a d” , I must say.

Creating our StackOverflow ChatBotOk, so finally we are at a stage where we can do something we love.

Use Data Science to power our Application/Chatbot.

Let us start with creating a rough architecture of what we are going to do next.

The Architecture of our StackOverflow ChatbotWe will need to create two classifiers and save them as .

pkl files.

Intent-Classifier: This classifier will predict if it a question is a Stack-Overflow question or not.

If it is not a Stack-overflow question, we let Chatterbot handle it.

Programming-Language(Tag) Classifier: This classifier will predict which language a question belongs to if the question is a Stack-Overflow question.

We do this so we can search for those language questions in our database only.

To keep it simple we will create simple TFIDF models.

We will need to save these TFIDF vectorizers.

We will also need to store word vectors for every question for similarity calculations later.

Let us go through the process step by step.

You can get the full code in this jupyter notebook in my project repository.

Step 1.

Reading and Visualizing the Datadialogues = pd.

read_csv("data/dialogues.

tsv",sep=" ")posts = pd.

read_csv("data/tagged_posts.

tsv",sep=" ")dialogues.

head()Dialogues Dataposts.

head()StackOverflow Posts dataprint("Num Posts:",len(posts))print("Num Dialogues:",len(dialogues))Num Posts: 2171575Num Dialogues: 218609Step 2: Create training data for Intent classifier — Chitchat/StackOverflow QuestionWe will be creating a TFIDF model with Logistic regression to do this.

If you want to know about the TFIDF model you can read it here.

We could also have used one of the Deep Learning models or transfer learning approaches to do this, but since the main objective of this post is to get a chatbot up and running and not worry too much about the accuracy we sort of work with the TFIDF based model only.

texts = list(dialogues[:200000].

text.

values) + list(posts[:200000].

title.

values)labels = ['dialogue']*200000 + ['stackoverflow']*200000data = pd.

DataFrame({'text':texts,'target':labels})def text_prepare(text): """Performs tokenization and simple preprocessing.

""" replace_by_space_re = re.

compile('[/(){}[]|@,;]') bad_symbols_re = re.

compile('[^0-9a-z #+_]') stopwords_set = set(stopwords.

words('english')) text = text.

lower() text = replace_by_space_re.

sub(' ', text) text = bad_symbols_re.

sub('', text) text = ' '.

join([x for x in text.

split() if x and x not in stopwords_set]) return text.

strip()# Doing some data cleaningdata['text'] = data['text'].

apply(lambda x : text_prepare(x))X_train, X_test, y_train, y_test = train_test_split(data['text'],data['target'],test_size = .

1 , random_state=0)print('Train size = {}, test size = {}'.

format(len(X_train), len(X_test)))Train size = 360000, test size = 40000Step 3.

Create Intent classifierHere we Create a TFIDF Vectorizer to create features and also train a Logistic regression model to create the intent_classifier.

Please note how we are saving TFIDF Vectorizer to resources/tfidf.

pkl and intent_classifier to resources/intent_clf.

pkl.

We will need these files once we are going to write the SimpleDialogueManager class for our final Chatbot.

# We will keep our models and vectorizers in this folder!mkdir resourcesdef tfidf_features(X_train, X_test, vectorizer_path): """Performs TF-IDF transformation and dumps the model.

""" tfv = TfidfVectorizer(dtype=np.

float32, min_df=3, max_features=None, strip_accents='unicode', analyzer='word',token_pattern=r'w{1,}', ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1, stop_words = 'english') X_train = tfv.

fit_transform(X_train) X_test = tfv.

transform(X_test) pickle.

dump(tfv,vectorizer_path) return X_train, X_testX_train_tfidf, X_test_tfidf = tfidf_features(X_train, X_test, open("resources/tfidf.

pkl",'wb'))intent_recognizer = LogisticRegression(C=10,random_state=0)intent_recognizer.

fit(X_train_tfidf,y_train)pickle.

dump(intent_recognizer, open("resources/intent_clf.

pkl" , 'wb'))# Check test accuracy.

y_test_pred = intent_recognizer.

predict(X_test_tfidf)test_accuracy = accuracy_score(y_test, y_test_pred)print('Test accuracy = {}'.

format(test_accuracy))Test accuracy = 0.

989825The Intent Classifier has a pretty good test accuracy of 98%.

TFIDF is not so bad.

Step 4: Create Programming Language classifierLet us first create the data for Programming Language classifier and then train a Logistic Regression model using TFIDF features.

We save this tag Classifier at the location resources/tag_clf.

pkl.

We do this step mostly because we don’t want to do similarity calculations over the whole database of questions but only on the subset of questions by the language tag.

# creating the data for Programming Language classifier X = posts['title'].

valuesy = posts['tag'].

valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.

2, random_state=0)print('Train size = {}, test size = {}'.

format(len(X_train), len(X_test)))Train size = 1737260, test size = 434315vectorizer = pickle.

load(open("resources/tfidf.

pkl", 'rb'))X_train_tfidf, X_test_tfidf = vectorizer.

transform(X_train), vectorizer.

transform(X_test)tag_classifier = OneVsRestClassifier(LogisticRegression(C=5,random_state=0))tag_classifier.

fit(X_train_tfidf,y_train)pickle.

dump(tag_classifier, open("resources/tag_clf.

pkl", 'wb'))# Check test accuracy.

y_test_pred = tag_classifier.

predict(X_test_tfidf)test_accuracy = accuracy_score(y_test, y_test_pred)print('Test accuracy = {}'.

format(test_accuracy))Test accuracy = 0.

8043816124241622Not Bad again.

Step 5: Store Question database EmbeddingsOne can use pre-trained word vectors from Google or get a better result by training their embeddings using their data.

Since again accuracy and precision is not the primary goal of this post, we will use pretrained vectors.

# Load Google's pre-trained Word2Vec model.

model = gensim.

models.

KeyedVectors.

load_word2vec_format('GoogleNews-vectors-negative300.

bin', binary=True)We want to convert every question to an embedding and store them so we don’t calculate the embeddings for the whole dataset every time.

In essence, whenever the user asks a Stack Overflow question, we want to use some distance similarity measure to get the most similar question.

def question_to_vec(question, embeddings, dim=300): """ question: a string embeddings: dict where the key is a word and a value is its' embedding dim: size of the representation result: vector representation for the question """ word_tokens = question.

split(" ") question_len = len(word_tokens) question_mat = np.

zeros((question_len,dim), dtype = np.

float32) for idx, word in enumerate(word_tokens): if word in embeddings: question_mat[idx,:] = embeddings[word] # remove zero-rows which stand for OOV words question_mat = question_mat[~np.

all(question_mat == 0, axis = 1)] # Compute the mean of each word along the sentence if question_mat.

shape[0] > 0: vec = np.

array(np.

mean(question_mat, axis = 0), dtype = np.

float32).

reshape((1,dim)) else: vec = np.

zeros((1,dim), dtype = np.

float32) return veccounts_by_tag = posts.

groupby(by=['tag'])["tag"].

count().

reset_index(name = 'count').

sort_values(['count'], ascending = False)counts_by_tag = list(zip(counts_by_tag['tag'],counts_by_tag['count']))print(counts_by_tag)[('c#', 394451), ('java', 383456), ('javascript', 375867), ('php', 321752), ('c_cpp', 281300), ('python', 208607), ('ruby', 99930), ('r', 36359), ('vb', 35044), ('swift', 34809)]We save the embeddings in a folder aptly named resources/embeddings_folder.

This folder will contain a .

pkl file for every tag.

For example one of the files will be python.

pkl.

!.mkdir resources/embeddings_folderfor tag, count in counts_by_tag: tag_posts = posts[posts['tag'] == tag] tag_post_ids = tag_posts['post_id'].

values tag_vectors = np.

zeros((count, 300), dtype=np.

float32) for i, title in enumerate(tag_posts['title']): tag_vectors[i, :] = question_to_vec(title, model, 300) # Dump post ids and vectors to a file.

filename = 'resources/embeddings_folder/'+ tag + '.

pkl' pickle.

dump((tag_post_ids, tag_vectors), open(filename, 'wb'))We are nearing the end now.

We need to have a function to get most similar question’s post id in the dataset given we know the programming Language of the question and the question.

Here it is:def get_similar_question(question,tag): # get the path where all question embeddings are kept and load the post_ids and post_embeddings embeddings_path = 'resources/embeddings_folder/' + tag + ".

pkl" post_ids, post_embeddings = pickle.

load(open(embeddings_path, 'rb')) # Get the embeddings for the question question_vec = question_to_vec(question, model, 300) # find index of most similar post best_post_index = pairwise_distances_argmin(question_vec, post_embeddings) # return best post id return post_ids[best_post_index]get_similar_question("how to use list comprehension in python?",'python')array([5947137])we can use this post ID and find this question at https://stackoverflow.

com/questions/5947137The question the similarity checker suggested has the actual text: “How can I use a list comprehension to extend a list in python?.[duplicate]”Not too bad but we could have done better if we train our embeddings or use starspace embeddings.

Assemble the Puzzle— SimpleDialogueManager ClassFinally, we have reached the end of the whole exercise.

Now, we have to fit all the pieces in the puzzle in our SimpleDialogueManagerClass.

Here is the code for that.

Go through the comments to understand how the pieces fit together to build one wholesome logic.

Just see the initialization and generate_answer function.

Click for the code of the whole main.

py for you to use and see.

Just run the whole main.

py using$ python main.

pyAnd we will have our bot up and running if it is able to access all the resources.

Yay!Again, here is the link to the GitHub repositoryThe possibilities are really endlessThis is just a small demo project of what you can do with the chatbots.

You can do a whole lot more once you recognize the backend is just python.

One idea is to run a chatbot script on all the servers I have to run system commands straight from telegram.

We can use os.

system to run any system command.

Bye Bye SSH.

You can make chatbots to do some daily tasks by using simple keyword-based intents.

It is just simple logic.

Find out the weather, find out cricket scores or maybe newly released movies.

Whatever floats your boat.

Or maybe try to integrate Telegram based Chatbot in your website.

See livechatbotOr maybe just try to have fun with it.

ConclusionHere we learned how to create a simple chat bot.

And it works okay.

We can improve a whole lot on this present chatbot by increasing classifier accuracy, handling edge cases, making it respond faster, using better similarity measures/embeddings, or maybe adding more logic to handle more use cases.

But the fact remains the same.

The AI in chatbots is just simple human logic and not magic.

In this post, I closely followed one of the projects from this course to create this chatbot.

Do check out this course if you get confused, or tell me your problems in the comments.

I will certainly try to help.

Follow me up at Medium or Subscribe to my blog to be informed about my next posts.

Till then Ciao!!.

. More details

Leave a Reply