Exploratory Data Analysis, Feature Engineering and Modelling using Supermarket Sales Data. Part 1.

e.t.cYou get the hang of it.I could also create features from the existing ones by doing what we call Feature crossing (More on this in the next post).Now that we’re familiar with the terms EDA and FE, let’s get our data and start exploring.We’ll be using a dataset of supermarket sales provided by Data Science Nigeria..Here’s a link to the data.We’ll use the popular prototyping tool Jupyter Notebooks and sorry “R”folks, I’ll be using Python for this exploration.After downloading your data, place it in the same folder as your notebook so you can access it..Start your notebook and import the following libraries as shown below.Here, we import numpy, pandas and matplotlib which is used for manipulation of arrays, processing of csv files and plot visualization respectively.Here, we import the stats module which contains some statistical functions such as norm and skew which we’ll use for calculating some statistics.We import seaborn..A powerful plotting library built on top of matplotlib..We’ll use this for creating some insightful plots.Python may throw some annoying warnings, we stop this using the ignore_warns function.We import os, a package for accessing files and folders easily..We use the os.listdir to show the content of our current directory (Present working directory)Next, let’s read in our data.We read in our data using the pandas read_csv() commandWe print out the first five rows of the data.Note: I transposed the train.head() command because I wanted to see all rows and columns on a single page without scrolling horizontally.Now that we can see our data and the features it contains..Let’s get down to business.source: pixabay.comFirst, we need to know our target variable.We could pick any feature to be our target and that will in turn tell us the kind of model we’ll build.For example, if we pick Product_Fat_Content to be our target variable, then we’ll have a classification problem and if we decided to pick Product_Price then it becomes a regression problem.But since we’re not concerned with building models in this post, we’ll just assume we’re trying to predict Product_Supermarket_Sales (Total number of sales made by a supermarket).Next, Let’s know our features..Since the number of features is small, we can manually look at them and instantly remove the ones that aren’t worth exploring or putting into our model..We’ll use our domain knowledge of a supermarket for this.I usually take out a pen and paper, draw five columns as shown below and manually fill it for all features:Feature == Important == In-between == Not-important ==ReasonYou could do this any other way though..Let’s look at these features.Product_Identifier: This is a unique ID for each particular product..Verdict: In-between.Reason: Sometimes it is best to remove any Unique ID columns because our model can overfit to this, but sometimes it may help..We can experiment with this.Supermarket_Identifier: This is a unique ID for each supermarket.Verdict: In-between.Reason: Same as aboveProduct_Supermarket_Identifier: This is a concatenation of the product and supermarket identifier.Verdict: In-betweenReason: This feature can used in place of Product and Supermarket Identifier.. More details

Leave a Reply