Boosting, Bagging and Stacking — A Comparative Analysis (2019 India Elections Case Study)

If you are a newcomer to this world, I have provided links throughout the article to help you out.

This blog is structured like this:Describe common Machine Learning algorithms like boosting, bagging and stacking methods.

Bagging Vs Boosting (Source : https://quantdare.

com/what-is-the-difference-between-bagging-and-boosting/)Stacking Ensemble with Logistic Regression + Bagging + Boosting + SVMAnalyze the above described ML models for the most prominent parties of elections.

Compare Precision-Recall for prediction models that states the general sentiments of nation towards both the parties.

Evaluate and interpret the accuracies and compare with other text classification techniques like Fasttext and NLTK based classifiers, which have been modeled in my previous blogs.

Pre-Processing and Text ClassificationThe step by step process of text cleaning and preprocessing after crawling twitter data has been explained in my first post and made available on github.

After the tweets have been preprocessed and labelled with different moods, they are fed into the text classification engine for mood prediction.

The following preprocessing steps have been undertaken before feeding the training and test dataset to the classification engine.

1.

Eliminating StopWords.

2.

Stemming and Lemmatizing to restrict the feature space of words.

3.

Removing low frequency words with TF-IDF.

4.

Diversify training corpus with tweets of various parties over 2–3 months.

For more details on some of the concepts related to removing stop-words, stemming and lemmatizing, please go through this article “Ultimate guide to deal with text data”.

After preprocessing the tweets (Steps 1 and 2), the entire tweet document for a selected party needs to be represented in a portable format to ease interpretability across machines.

For this, we use scikit-learn’s most popular libraries on text classification — CountVectorizer and TF-IDF to evaluate word/n-gram frequency by selecting/dropping certain words based on its rareness or higher rate of occurrence.

Sentiment Analysis with CountVectorizer and Tf-IdfCountVectorizer: It gives a matrix representation of frequency counts of each term in a specific document by tokenizing the text over the entire document.

TF-IDF: It represents the importance of a word to a document in a corpus.

TF-IDF value is proportional to the frequency of a word in a document.

Term Frequency (TF) is computed by : (Number of times term t appears in a document) / (Total number of terms in the document), while IDF(t) can be computed for words, characters and documents and is given by : log_e(Total number of documents / Number of documents with term t in it).

For example for a set of words:TF from tweets = {‘plz note’: 89, ‘modi god india’: 66, ‘keep going’: 53, ‘work done’: 135, ‘pm shri’: 96, ‘people should’: 84, ‘man developing india’: 59, ‘surgical strike’: 116}IDF from tweets = [6.

42824854 6.

42824854 5.

86863275 5.

86863275]Pipeline: A model chain built by composing feature extraction (CountVectorizer, TF-IDF) and classification techniques (like boosting, bagging or stacking) together.

These above three algorithms are applied on the most prominent parties of 2019 Elections over a period of 2.

5 months and results are compared.

The below code snippet captures parameters frequency parameters of words in the tweet history with CountVectorizer and TfIdfVectorizer.

CountVectorizer params:min_df = 2, implies drop terms that appear in less than 5 documents.

max_df = 100, implies drop terms that appear in more than 100 documents.

TfIdfVectorizer params:ngram_range(min, max)= Represents the lower and upper boundary of the range of n-values for different n-grams to be extracted with the condition min_n <= n <= max_n.

max_features = Represents the top max_features ordered by term frequency across the tweet history.

analyzer = Represents the unit used to compose the feature (word or character n-grams).

token = Regular expression with 2 or more alphanumeric characters.

Source for Plotting Multi-label Classes for Bagging/Boosting/Stacking ClassifiersBagging AlgorithmsBagging also known as Bootstrapped Aggregation is one of the methods used for Predictive modeling (CART) to produce a final model by averaging several models on random subsets of data, drawn from the original training dataset with replacement.

One popular way of building Bagging Models is by combining several DecisionTrees with reduced bias that increases the model’s prediction than individual DecisionTrees.

Averaging ensembles with bagging techniques like RandomForestClassifier and ExtraTreesClassifier aims to reduce the variance and increase the model robustness with respect to small changes in the data.

As DecisionTrees form an integral part of individual models used in Bagging, lets first understand the concept of DecisionTrees before going into the details of Bagging algorithms.

DecisionTreesClassifierDecision tree classifier is built of decision nodes and leaf nodes, where the decision of validating feature values and creating branches are assigned to decision nodes, while labels are assigned to leaf nodes.

Decision trees grow by selecting best decision stumps during classification process.

The best decision stump can be chosen by computing the information gain or entropy of the system.

Information gain serves as an estimation of the degree of disorganization for the input values due to same or different labels assigned to them.

Leaf nodes with error below acceptable margin are replaced by new decision stumps on subset of training data that excludes the path from root of the subtree to the leaf.

For more details on decision trees please browse though the following articles.

Decision Tree, Source: https://www.

displayr.

com/how-is-splitting-decided-for-decision-trees/ExtraTreesClassifier represents an extremely randomized version of DecisionTreesClassifier that works on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Extra trees yields a higher performance in presence of noisy features.

RandomForestClassifier is a kind of non-parametric algorithm using bagging technique, where it aggregates the results from an ensemble of estimators or decision trees.

RandomForestClassifier works on the principle of majority vote among a number of estimators, that can yield better results than any individual estimator.

It selects a set of features and observations from the training set with replacement and iteratively evaluates the best split point for a single tree.

Once a best split point is inferred for a single tree, the tree is grown to the fullest.

Several such decision trees are constructed and an mode of predictions is taken to be the final predicted result.

https://support.

bccvl.

org.

au/support/solutions/articles/6000083217-random-forestPrecision-Recall Curves for BaggingThe precision-recall curves for different classifiers (for different predicted moods — anger, arousal, dominance, faith, fear, joy, neutral.

sadness) are plotted below by computing the average precision and recall of the selected classes.

The precision-recall plot uses recall on the x-axis and precision on the y-axis and the point of intersection (as given the metric on the legend) gives the computed value of the precision-recall metrics of the overall class.

Bagging Classifier : BJP Sentiment predictionBagging Classifier : Congress Sentiment predictionBagging Classifier with RandomForest : BJP Sentiment predictionBagging Classifier with RandomForest : Congress Sentiment predictionDecisionTree Classifier : BJP Sentiment predictionDecisionTree Classifier : Congress Sentiment predictionExtraTrees Classifier : BJP Sentiment predictionExtraTrees Classifier : Congress Sentiment predictionRandomForest Classifier : BJP Sentiment predictionRandomForest Classifier : Congress Sentiment predictionBagging Analysis Source code with CountVector, TfIdf (Term Frequency–Inverse Document Frequency) and using both CountVector, TfIdf with PipelineInterpretation of Bagging ResultsCountVectorizer and the Pipeline method using CountVectorizer and TfidfTransformer performs better than only TfidfTransformerBagging on RandomForest has highest performance accuracy of 50.

6% as compared to ExtraTrees with accuracy of 49.

3% for Congress sentiment prediction.

ExtraTrees has highest performance accuracy of 55.

2% as compared to RandomForest with accuracy of 51% for BJP sentiment prediction.

DecisionTree performs the worst with an average of 45% accuracy, whereas BaggingClassifier with base estimator as RandomForest performs slightly better than it with 51% accuracy for BJP sentiment prediction.

Only BaggingClassifier performs the worst with an average of 50% accuracy, whereas RandomForest performs slightly better than it with 49.

3% accuracy for Congress sentiment prediction.

Moods labelled with Dominance, Faith, Fear, Anger, Arousal, Joy works best for Congress, in the predicted models with 0.

6, 0.

6 0.

6, 0.

5, 0.

5 and 0.

5 F1 score respectively in Bagging Classifier with base estimator as RandomForest.

Moods labelled with Dominance, Faith, Fear, Arousal works best for BJP, in the predicted models with 0.

6, 0.

5 , 0.

5, 0.

5 F1 score respectively in ExtraTrees Classifier.

Less prominent and labelled moods like Neutral , Sadness have very low precision, recall and F1 score as they have very few instances.

Boosting AlgorithmsBoosting is one of the methods used for Predictive modeling (CART) to produce a final model by averaging several models on random subsets of data, drawn from the original training dataset without replacement.

The algorithm works by combining several sequential learners where the performance of each weak learner (with higher misclassification rate) is boosted in the next step by assigning a higher weight to it.

Boosting decreases the model variance with increased training data set and aims to minimize the loss functions (library provided : logistic or custom loss functions) and converge the weak learners to a more accurate voted model.

AdaBoostClassifier:AdaBoost or “Adaptive Boosting”, is a boosting classifier technique to derive a strong learner from several weak learners by assigning them higher weights sequentially.

The classifier uses decision stumps to classify randomly selected data points for the entire training dataset to a best fit model with minimal or no error.

The prediction for the ensemble model is given by the sum of the weighted predictions.

The predicted sum if positive denotes the first class and the second class otherwise.

GradientBoostingClassifier:One of the best algorithms that combines Gradient Descent with Boosting which has been used by many Kaggle winners.

The objective of this algorithm is to reduce the loss or the error residuals (the difference between the actual value and predicted value) by successively adding a weak learner at each step and improve the model performance.

It builds an ensemble of trees by sequentially adding individual trees over successive iterations.

GradientBoosting takes more time for training large datasets and this time increases proportionally to the number of features and training instances.

XGBoostClassifier:Extreme Gradient Boosting algorithm was developed to exploit the parallel processing capability of multi-core CPUs in terms of training time, speed and size of the training data.

Its scalable, portable, memory efficient and more accurate than Gradient Boosting .

It uses extra randomization parameter (column subsampling in addition to row-subsampling) and more regularized models (L1 and L2) to yield better performance and reduce correlation among trees.

In addition, the number of terminal nodes for each tree may vary and is limited by the maximum number of terminal nodes.

XGBoost uses Newton Boosting to converge to the minima in less number of steps than Gradient Descent in GradientBoosting.

XGBoost, Source : https://luckytoilet.

wordpress.

com/tag/xgboost/LightGBMClassifier:LightGBM speeds up the training process of popular GradientBoosting by up to over 20 times while achieving almost the same accuracy.

It has also known to significantly outperform XGBoost in terms of computational speed and memory consumption.

Unlike other DecisionTree algorithms which grow leaves level-wise, LightGBM grows trees leaf-wise by choosing the leaf with max delta loss to grow.

LightGBM uses the concept of Gradient-based One-Side Sampling (GOSS) to attain accurate information gain estimation, where instances with large gradients (e.

g.

, larger than a pre-defined threshold, or among the top percentiles), are kept and instances with small gradients are dropped randomly.

It also uses the concept of Exclusive Feature Bundling (EFB), where the feature space is reduced by representing 2 non-mutually exclusive features by means of two vertices and a single connecting edge.

LightGBM allows data to be present in each worker node and allow each worker node to decide the local best split point {feature, threshold} on local feature set.

Further it allows workers to communicate the local best splits with each other to get the best split point for the DecisionTree.

LightGBM , Source : https://media.

readthedocs.

org/pdf/lightgbm/latest/lightgbm.

pdfPrecision-Recall Curves for BoostingThe precision-recall curves for different classifiers (for different predicted moods — anger, arousal, dominance, faith, fear, joy, neutral.

sadness) are plotted below by computing the average precision and recall of the selected classes.

The precision-recall plot uses recall on the x-axis and precision on the y-axis and the point of intersection (as given the metric on the legend) gives the computed value of the precision-recall metrics of the overall class.

AdaBoost Classifier : BJP Sentiment predictionAdaBoost Classifier : Congress Sentiment predictionGradientBoost Classifier : BJP Sentiment predictionGradientBoost Classifier : Congress Sentiment predictionLight GradientBoost Classifier : BJP Sentiment predictionLight GradientBoost Classifier : Congress Sentiment predictionXgBoost Classifier : BJP Sentiment predictionXgBoost Classifier : Congress Sentiment predictionBoosting Analysis Source code with CountVector, TfIdf (Term Frequency–Inverse Document Frequency) and using both CountVector, TfIdf with PipelineInterpretation of Boosting ResultsCountVectorizer and the Pipeline method using CountVectorizer and TfidfTransformer performs better than only TfidfTransformerGradientBoosting has highest performance accuracy of 57.

3% as compared to XgBoost with accuracy of 49.

8% for Congress sentiment prediction.

GradientBoosting has highest performance accuracy of 54% as compared to XgBoost with accuracy of 46% for BJP sentiment prediction.

AdaBoost performs the worst with an average of 40% accuracy, whereas LightGBM performs slightly better than it with 45% accuracy for Congress sentiment prediction.

AdaBoost and LightGBM both performs almost the same with 41% accuracy for BJP sentiment prediction.

Moods labelled with Dominance, Arousal, Faith works best for Congress, in the predicted models with 0.

6, 0.

6 and 0.

7 F1 score respectively in GradientBoosting Classifier.

Moods labelled with Faith, Dominance, Anger, Joy works best for BJP, in the predicted models with 0.

7, 0.

6 , 0.

5, 0.

5 F1 score respectively in GradientBoosting Classifier.

Less prominent and labelled moods like Fear, Neutral , Sadness have very low precision, recall and F1 score as they have very few instances.

StackingStacking is one of the methods used for Predictive modeling to produce a final model by combining several models with a meta-learner.

The meta- learner uses input and output of each individual model with different optimized weights to yield an ensemble prediction.

Precision-Recall Curves for StackingThe precision-recall curves for different classifiers (for different predicted moods — anger, arousal, dominance, faith, fear, joy, neutral.

sadness) are plotted below by computing the average precision and recall of the selected classes.

The precision-recall plot uses recall on the x-axis and precision on the y-axis and the point of intersection (as given the metric on the legend) gives the computed value of the precision-recall metrics of the overall class.

BJP Sentiment prediction : Voting Classifier with Logistic Regression , RandomForest, DecisionTree and Stochastic Gradient Descent with weights = [1,2,2,1]Congress Sentiment prediction : Voting Classifier with Logistic Regression , RandomForest, DecisionTree and Stochastic Gradient Descent with weights = [1,2,2,1]Interpretation of Stacking ResultsCountVectorizer and the Pipeline method using CountVectorizer and TfidfTransformer performs better than only TfidfTransformerBJP yields accuracy, while Congress yields 49% accuracy for sentiment prediction.

Moods labelled with Faith, Dominance, Anger, Arousal, Fear works best for BJP, in the predicted models with 0.

6, 0.

6 , 0.

6, 0.

5 , 0.

5, 05 F1 score respectively with Voting Classifier.

Moods labelled with Dominance, Faith, Arousal, Joy, Sadness works best for Congress, in the predicted models with 0.

7, 0.

5 , 0.

5, 0.

5 , 0.

5, 05 , 0.

5 F1 score respectively with Voting Classifier.

Less prominent and labelled moods like Neutral, Sadness have very low precision, recall and F1 score as they have very few instances.

Problems and Improvements with Text Based Classification techniquesText Based Classification with CountVectorizer and TfidfVectorizer suffers from scalability and other issues as they map the unicode string feature names to the integer feature indices.

Memory usage of the text vectorizer: CountVectorizer TfidfVectorizer classes use string to represent the features which gets loaded in memory.

Parallelization problems for text feature extraction: The size of tweet vocabulary_ grows logarithmically with the size of the training corpus as and when more new tweets become available.

The vocabularies (built with unique tweets) can be built in parallel through distributed clusters as they share some words and requires some kind of shared data structure or synchronization barrier.

Impossibility to do online or out-of-core/streaming learning: For text based classifiers, the vocabulary and its size needs to be learned from the data, and is not known before making one pass over the full dataset.

Online learning and prediction: For an incoming infinite tweeter stream to predict live sentiments of people, we need a machine learning algorithm that supports incremental learning (partial_fit method in scikit-learn).

The tweet source can be trained with online machine learning algorithm using the hashing vectorizer.

The run-time predictive performance of the model to be evaluated on a separate non-overlapping validation set.

Scaling Text Based Classification : HashingVectorizer an alternative to TfidfVectorizer is stateless, and can be used to extract features on independent data sets of the tweet vocabulary in parallel or distributed processes.

Each partition of extracted features can then be fed to independent instances of a linear classifier model on each computing node and then the results of the independent classifiers can be averaged.

Precision Recall : From the definition of precision and recall we know, high precision yields low false positives, whereas high recall yields low false negatives.

While going through the curves we see that for some of the classes, the line does not remain visible because of very low precision and recall.

For most of the classes we observe higher precision than recall, signifying increased certainty of our predictions except for certain moods like “Dominance” where recall exceeds the precision.

Both precision and recall can be further improved by building an extensive word corpus specific to elections context and by up-sampling and capturing more minority mood tweets.

Accuracy : Boosting, Bagging and Stacking performs much worse than FastText and other NLTK based classifiers like Naive Bayes, Decision Trees, SVM and SVC as discussed in the previous posts.

Performance of the above models can be further improved by choosing the right parameters by using hyper-parameter tuning.

Analyzing the accuracy for each of all the models is beyond the scope of this post.

folds = 5skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=1001)rf_clf=RandomForestClassifier(n_estimators=500,max_features=0.

25,criterion="entropy",class_weight="balanced")random_search = RandomizedSearchCV(BaggingClassifier(base_estimator = rf_clf, n_estimators =25,max_features=0.

25), param_distributions=parameters, cv=skf.

split(xtrain_tfidf, training_set_y), n_iter=10)text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('rfrandomsearch', random_search)])Referenceshttps://towardsdatascience.

com/boosting-algorithm-adaboost-b6737a9ee60chttps://www.

cs.

cmu.

edu/~jgc/publication/A_New_Pairwise_Ensemble_Approach_ICML_2003.

pdfhttps://papers.

nips.

cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.

pdfhttps://medium.

com/analytics-vidhya/twitter-sentiment-analysis-for-the-2019-election-8f7d52af1887https://medium.

com/analytics-vidhya/sentiment-classification-for-2019-elections-using-text-based-classifiers-217f86b05124https://medium.

com/analytics-vidhya/elections-2019-mood-classification-with-text-based-classifiers-ii-bf23c3dfac7fPlease let me know if there were any mistakes, suggestions feedbacks are welcome.

The election repository is available at https://github.

com/sharmi1206/elections-2019.

Please feel free to follow me at linkedin.

Disclaimer StatementThe work analyses tweet of 2 prominent parties for the upcoming election.

The author has no intention to create controversy in people’s mind or hurt anybody’s feelings or incite feelings of anger or hatred.

Its purely done for academic, research and information purposes and somebody else might get different results on application of other techniques of analysis.

Its an unbiased and impartial summary and does not discriminate/differentiate any individual or group.

. More details

Leave a Reply