")feature_result_tgt = nfeature_accuracy_checker(vectorizer=tfidf,ngram_range=(1, 3))Before we are done here, we should check the classification report.
from sklearn.
metrics import classification_reportcv = CountVectorizer(max_features=30000,ngram_range=(1, 3))pipeline = Pipeline([ ('vectorizer', cv), ('classifier', rf) ])sentiment_fit = pipeline.
fit(X_train, y_train)y_pred = sentiment_fit.
predict(X_test)print(classification_report(y_test, y_pred, target_names=['negative','positive']))classification_reportChi-Squared for Feature SelectionFeature selection is an important problem in Machine learning.
I will show you how straightforward it is to conduct Chi square test based feature selection on our large scale data set.
We will calculate the Chi square scores for all the features and visualize the top 20, here terms or words or N-grams are features, and positive and negative are two classes.
given a feature X, we can use Chi square test to evaluate its importance to distinguish the class.
from sklearn.
feature_selection import chi2tfidf = TfidfVectorizer(max_features=30000,ngram_range=(1, 3))X_tfidf = tfidf.
fit_transform(result.
Reviews)y = result.
Positivitychi2score = chi2(X_tfidf, y)[0]plt.
figure(figsize=(16,8))scores = list(zip(tfidf.
get_feature_names(), chi2score))chi2 = sorted(scores, key=lambda x:x[1])topchi2 = list(zip(*chi2[-20:]))x = range(len(topchi2[1]))labels = topchi2[0]plt.
barh(x,topchi2[1], align='center', alpha=0.
5)plt.
plot(topchi2[1], x, '-o', markersize=5, alpha=0.
8)plt.
yticks(x, labels)plt.
xlabel('$chi^2$')plt.
show();LSTM Frameworkfrom sklearn.
feature_extraction.
text import CountVectorizerfrom keras.
preprocessing.
text import Tokenizerfrom keras.
preprocessing.
sequence import pad_sequencesfrom keras.
models import Sequentialfrom keras.
layers import Dense, Embedding, LSTMfrom sklearn.
model_selection import train_test_splitfrom keras.
utils.
np_utils import to_categoricalimport rePad sequencesIn order to feed this data into our RNN, all input documents must have the same length.
We will limit the maximum review length to max_words by truncating longer reviews and padding shorter reviews with a null value (0).
We can accomplish this using the pad_sequences() function in Keras.
For now, set max_words Then, I define the number of max features as 30000 and use Tokenizer to vectorize and convert text into Sequences so the Network can deal with it as input.
max_fatures = 30000tokenizer = Tokenizer(nb_words=max_fatures, split=' ')tokenizer.
fit_on_texts(result['Reviews'].
values)X1 = tokenizer.
texts_to_sequences(result['Reviews'].
values)X1 = pad_sequences(X1)Y1 = pd.
get_dummies(result['Positivity']).
valuesX1_train, X1_test, Y1_train, Y1_test = train_test_split(X1,Y1, random_state = 42)print(X1_train.
shape,Y1_train.
shape)print(X1_test.
shape,Y1_test.
shape)Design an RNN model for sentiment analysisWe start building our model architecture in the code cell below.
We have imported some layers from Keras that you might need but feel free to use any other layers / transformations you like.
Remember that our input is a sequence of words (technically, integer word IDs) of maximum length = max_words, and our output is a binary sentiment label (0 or 1).
Keras Embedding LayerKeras offers an Embedding layer that can be used for neural networks on text data.
It requires that the input data be integer encoded, so that each word is represented by a unique integer.
This data preparation step can be performed using the Tokenizer API also provided with Keras.
The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.
It is a flexible layer that can be used in a variety of ways, such as:It can be used alone to learn a word embedding that can be saved and used in another model later.
It can be used as part of a deep learning model where the embedding is learned along with the model itself.
It can be used to load a pre-trained word embedding model, a type of transfer learning.
The Embedding layer is defined as the first hidden layer of a network.
It must specify 3 arguments:It must specify 3 arguments:input_dim: This is the size of the vocabulary in the text data.
For example, if your data is integer encoded to values between 0–10, then the size of the vocabulary would be 11 words.
output_dim: This is the size of the vector space in which words will be embedded.
It defines the size of the output vectors from this layer for each word.
For example, it could be 32 or 100 or even larger.
Test different values for your problem.
input_length: This is the length of input sequences, as you would define for any input layer of a Keras model.
For example, if all of your input documents are comprised of 1000 words, this would be 1000.
For example, below we define an Embedding layer with a vocabulary of 200 (e.
g.
integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.
e = Embedding(200, 32, input_length=50)The Embedding layer has weights that are learned.
If you save your model to file, this will include weights for the Embedding layer.
The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).
If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.
embed_dim = 150lstm_out = 200model = Sequential()model.
add(Embedding(max_fatures, embed_dim,input_length = X1.
shape[1], dropout=0.
2))model.
add(LSTM(lstm_out, dropout_U=0.
2,dropout_W=0.
2))model.
add(Dense(2,activation='softmax'))model.
compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])print(model.
summary())To summarize, our model is a simple RNN model with 1 embedding, 1 LSTM and 1 dense layers.
4,781,202 parameters in total need to be trained.
Train and evaluate our modelWe first need to compile our model by specifying the loss function and optimizer we want to use while training, as well as any evaluation metrics we’d like to measure.
Specify the appropriate parameters, including at least one metric ‘accuracy’.
batch_size = 32model.
fit(X1_train, Y1_train, nb_epoch = 3, batch_size=batch_size, verbose = 2)Once compiled, we can kick off the training process.
There are two important training parameters that we have to specify — batch size and number of training epochs, which together with our model architecture determine the total training time.
score,acc = model.
evaluate(X1_test, Y1_test, verbose = 2, batch_size = batch_size)print("score: %.
2f" % (score))print("acc: %.
2f" % (acc))score: 0.
51acc: 0.
84Finally measuring the number of correct guesses.
It is clear that finding negative tweets goes very well for the Network but deciding whether is positive is not really.
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0for x in range(len(X1_test)): result = model.
predict(X1_test[x].
reshape(1,X1_test.
shape[1]),batch_size=1,verbose = 2)[0] if np.
argmax(result) == np.
argmax(Y1_test[x]): if np.
argmax(Y1_test[x]) == 0: neg_correct += 1 else: pos_correct += 1 if np.
argmax(Y1_test[x]) == 0: neg_cnt += 1 else: pos_cnt += 1print("pos_acc", pos_correct/pos_cnt*100, "%")print("neg_acc", neg_correct/neg_cnt*100, "%")pos_acc 90.
67439409905164 %neg_acc 63.
2890365448505 %SummaryThere are several ways in which we can build our model.
We can continue trying and improving the accuracy of our model by experimenting with different architectures, layers and parameters.
How good can we get without taking prohibitively long to train.How do we prevent overfitting?That’s it for today.
Source code can be found on Github.
I am happy to hear any questions or feedback.
Connect with me at linkdin.
.. More details