Machine Learning — Multiclass Classification with Imbalanced Dataset

Multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.Imbalanced Dataset: Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally..For example, you may have a 3-class classification problem of set of fruits to classify as oranges, apples or pears with total 100 instances ..A total of 80 instances are labeled with Class-1 (Oranges), 10 instances with Class-2 (Apples) and the remaining 10 instances are labeled with Class-3 (Pears)..This is an imbalanced dataset and the ratio of 8:1:1..Most classification data sets do not have exactly equal number of instances in each class, but a small difference often does not matter..There are problems where a class imbalance is not just common, it is expected..For example, in datasets like those that characterize fraudulent transactions are imbalanced..The vast majority of the transactions will be in the “Not-Fraud” class and a very small minority will be in the “Fraud” class.DatasetThe data set we will be using for this example is the famous “20 News groups” data set..The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups..The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.scikit-learn provides the tools to pre-process the dataset, refer here for more details..The number of articles for each news group given below is roughly uniform.Removing some news articles from some groups to make the overall dataset imbalanced like below.Now our imbalanced dataset with 20 classes is ready for further analysis.Build ModelAs this is a classification problem, we will use the similar approach as described in my previous article for sentiment analysis..The only difference is here we are dealing with multiclass classification problem.The last layer in the model is Dense(num_labels, activation =’softmax'),with num_labels=20 classes, ‘softmax’ is used instead of ‘sigmoid’ .. More details

Leave a Reply