How I Improved Accuracy Of My Machine Learning Project?Follow these tips to get better resultsFarhad MalikBlockedUnblockFollowFollowingMay 6Working on a machine learning project can be a tedious task, in particular when you have gathered all of the available data and yet the model yields poor results.
This article should provide you with the tips that you can follow to improve the accuracy of your machine learning model.
1.
Always Target Data FirstPhoto by Franki Chamaki on UnsplashThis section aims to provide data processing techniques that can be followed to produce a good quality training set.
Building A Good Quality Training Data Set Is The Most Important Phase Of Predictive Analysis.
Occasionally, It Is Also The Most Time Consuming PartCommon Data Quality IssuesUsually following use cases are experienced when preparing data for your machine learning model:There might be missing or erroneous values in the data setThere might be categorical (Textual, Boolean) values in the data set and not all algorithms work well with textual values.
Some features might have larger values than others and are required to be transformed for equal importance.
Sometimes data contains a large number of dimensions and the number of dimensions are required to be reduced.
Techniques To Improve Data QualityUse case 1: Filling Missing ValuesLet’s assume we want to forecast a variable e.
g.
Company Sales and it is dependent on following two variables: Share Price and Total Employees of the company.
Both Share Price and Total Employees contain numerical values.
Let’s also assume that the data for Share Price and Total Employees over a range of dates is stored in different csv files.
Scenario: Once we join the two data sets using Python DataFrame Merge() method, we might see empty values, or placeholder strings such as NaN indicating that the number is blank.
Issue: Most of the models are unable to fit and predict the values when we feed them with missing values.
Solution: Pandas data frame provides a number of features to replace missing values.
Step 1: Place data into a pandas data framedata_frame = pd.
read_csv(my_data)Step 2: One option is to remove columns/rows with empty values, however I do not recommend this approach:#remove all missing datadata_frame.
dropna()#remove missing data from specific columnsdata_frame.
dropna(subset=['TOtal Wealth'])Gathering clean data is a time consuming task and removing columns (features) or rows can end up losing important information from a data set.
Better option: Replace missing values by setting a default value to replace NaN, back or front filling data sets, interpolate or extrapolate the values etc.
We can also use a model and train it with the training data set so that it can return appropriate values to fill missing values.
One of the appropriate strategy is to interpolate values using sci kit learn Imputer.
As an example, we can do this:#set default valuedata_frame = pd.
read_csv(my_data)#back filldata_frame.
replacena(method='bfill')#front filldata_frame.
replacena(method='bfill')#interpolate by placing mean values using sci-kit learnfrom sklearn.
preprocessing import Imputerimr = Imputer(missing_values='NaN', strategy='mean', axis=0)imr = imr.
fit(df)imputed_data = imr.
transform(data_frame.
values)Once we have replaced missing values, now we need to see if we have any categorical values in our data set.
If you are looking for introduction to python then please read this article:Python From ScratchPython is one of the most popular programming language for data analysis and Machine Learning.
Additionally a large…medium.
comUse case 2: Handling Categorical ValuesLet’s assume we want to forecast a variable e.
g.
Number Of Tweets and it is dependent on following two variables: Most Active Current News Type and Number Of Active Users.
In this instance, Most Active Current News Type is a categorical feature.
It can contain textual data such “Fashion”, “Economical” etc.
Additionally, Number Of Active Users contains numerical fields.
Scenario: Before we feed the data set into our machine learning model, we need to transform categorical values into numerical values because many models do not work with textual values.
Solution: There are a number of strategies to handle categorical features:Create a dictionary to map categorical values to numerical valuesA dictionary is a data storage structure.
It contains a list of key-value paired elements.
It enables a key to be mapped to a value.
map = {'Fashion': 1, 'Economical':2}#this will map categorical to numerical valuestarget_feature = 'Most Active Current News Type'data_frame[target_feature] = data_frame[target_feature].
map(map)This strategy works well for ordinal values too.
Ordinal values are those textual values that can be ordered such as Clothes Size (Small, Medium, Large etc).
2.
Another strategy is to use encoders to assign a unique numerical value to each textual value.
This strategy works better for variable with a large number of distinct values (>30) such as for managing Organisational Job Hierarchy.
We could use manual or sci-kit encoders.
2.
1 Manual Encodersimport numpy as nptarget_feature = 'Most Active Current News Type'#get unique valuesunique = np.
unique(data_frame[target_feature])map = {textual_value:index for index,textual_value in enumerate(map)}#apply map#this will map categorical to numerical valuesdata_frame[target_feature] = data_frame[target_feature].
map(map)2.
2 Sci Kit Learn Encodersfrom sklearn.
preprocessing import LabelEncodertarget_feature = 'Most Active Current News Type'#use encoder and transformencoder = LabelEncoder()encoded_values = encoder.
fit_transform(data_frame[target_feature].
values)data_frame[target_feature] = pd.
Series(encoded_values, index=data_frame.
index)#to inverse, use inverse methoddecoded = encoder.
inverse_transform(data_frame[target_feature].
values)data_frame[target_feature] = pd.
Series(decoded, index=data_frame.
index)One more step which is often missed outI have often seen this scenario: After the textual values are encoded to numerical values, we will see some values which will be greater than the other values.
Higher values imply they have higher importance.
This can lead to our models treating features differently.
As an instance, Fashion news type might get a value of 1 and Economical news type might get a value of 10.
This makes the machine learning model assume that Economical news type has more importance than Fashion news type.
Solution: We can solve this by using One-Hot EncodingOne Hot EncodingTo prevent some categorical values getting higher importance than the others, we could use the one hot encoding technique before we feed encoded data into our machine learning model.
One hot encoding technique essentially creates a replica (dummy) feature for each distinct value in our target categorical feature.
Once the dummy values are created, a boolean (0 or 1) is populated to indicate whether the value is true or false for the feature.
As a consequence, we end up get a wide sparse matrix which has 0/1 values populated.
As an instance, if your feature has values “A”, “B” and “C” then three new features (columns) will be created: Feature A, Feature B and Feature C.
If first row’s feature value was A then for feature A, you will see 1 and for feature B and C, it will be 0 and so on.
Solution:We can use Pandas get_dummies() method that only converts categorical values to integers.
data_frame = pd.
get_dummies(data_frame)Additionally, we could use sklearn.
preprocessing.
OneHotEncoderTip: Always One Hot Encode After Encoding Textual Values To Prevent OrderingUse case 3: Scaling FeaturesNow all missing values are populated and categorical values have been transformed to numerical values.
Usually, when we have multiple features in our data set, we need to ensure that the values of the data set are scaled properly.
Range of values in a feature should reflect their importance.
Higher values imply higher importanceScenario: Let’s assume we want to measure closing stock price.
We want to use a simple best fit-line regression model that uses GBP to EUR exchange rate and number of employees of a company to predict a stock’s price.
Therefore, we gather data set that contains GBP to EUR exchange rate and number of employees of the company over time.
Exchange rates will range from 0 to 1 whereas number of employees will be far bigger values, and could be in 1000s.
Subsequently, the model will consider number of employees to have higher precedence than exchange rates.
There are two ways to scale the features:Normalisation: Ensuring all values range between 0 and 1.
It can be done by implementing following routine:Normalised Value = (Value – Feature Min)/(Feature Max – Feature Min)sklearn.
preprocessing.
MinMaxScaler can be used to perform normalisation.
Standardisation: Ensuring values in a feature follows normal distribution whereby mean of the values is 0 and standard deviation is 1.
Standarderised Value = (Value – Feature Mean)/Feature Standard Deviationsklearn.
preprocessing.
StandardScaler can be used to perform standarisation.
Standardisation technique is a superior than normalisation technique in most scenarios because it maintains outliers and transforms data into normal distribution.
Normal distribution allows models to predict values easily and the weights can also be easily determined which again helps predictive models.
Key: Train Scalers On Training Set Only.
Do Not Use All Of The Data.
When we are training our models even when we are training imputers or scalars, always use training set to the train models.
Leave test or validation set for testing only.
Use case 4: Remove Existing FeaturesLet’s assume you train your machine learning model on a training set and you are using a measure, such as Adjusted R Squared to assess quality of your machine learning model.
Your model’s Adjusted R Squared is 90%+ implying that your model can predict 90% of the values accurately.
Scenario: When you feed your test data in to your model, you experience a very low adjusted R squared score implying that the model is not accurate and is over-fitting the training data.
This is a classic case of over-fittingSome features are just not as important as we first conclude from the training set.
It can end up over-fitting our machine learning model.
Solutions:There are several methods to prevent over-fitting such as by adding more data and/or eliminating features.
I have outlined some solutions in my article:Supervised Machine Learning: Regression Vs ClassificationIn this article, I will explain the key differences between regression and classification supervised machine learning…medium.
comWe can remove features that have strong correlation with each other.
You can use correlation matrix to determine correlation between all independent variables.
We could also use scatter mix plot to determine how all variables are linked to each other.
2.
We can use RandomForestClassifer which can give us importance of each feature:my_importance_model = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)my_importance_model.
fit(independent_variables, dependent_variables)print(my_importance_model.
feature_importances_)Least important features can be excluded.
Use case 5: Creating New Features From Existing FeaturesOccasionally we want to create a new feature out of one or more of the features.
Sometimes we can also create a new feature out of the dependent variable, the variable which we we want to predict.
As an example, in time series predictive analysis, we can extract trend and seasonality out from the data and then feed Trend and Seasonality as separate features to forecast our target variable.
Time series analysis is a complex topic and is covered in detail here, here and here.
How Do I Predict Time Series?Forecasting, modelling and predicting time series is increasingly becoming popular in a number of fields.
Time series…medium.
comHow Good Is My Predictive Model — Regression AnalysisForecasting is an important concept in econometric and data science.
It is also widely used in artificial intelligence…medium.
comUse case 6: Reducing DimensionsScenario: Occasionally we want to reduce the number of dimensions.
An example is when we want to scrape websites and convert textual data into vectors by using word to vector encoding algorithms.
We can end up getting a sparse matrix.
Issue: Sparse matrix can slow down the algorithms.
Solution: Decompose the matrix but ensure valuable information is not lost.
We can use Principal component analysis (PCA), Linear Discriminant Analysis (LDA) or Kernel principal component analysis to reduce the dimensions.
2.
Now Fine Tune The Model ParametersPhoto by Steve Harvey on UnsplashFine tuning machine learning predictive model is a crucial step to improve accuracy of the forecasted results.
In the recent past, I have written a number of articles that explain how machine learning works and how to enrich and decompose the feature set to improve accuracy of your machine learning models.
This section discovers details of:Retrieving estimates of a model’s performance using scoring metricsFinding and diagnosing the common problems of machine learning algorithmsFine-tuning parameters of machine learning modelsStep 1: Understand what tuning machine learning model isSometimes, we have to explore how model parameters can enhance forecasting accuracy of our machine learning model.
Fine tuning machine learning model is a black art.
It can turn out to be an exhaustive task.
I will be covering a number of methodologies in this article that we can follow to get accurate results in a shorter time.
I am often asked a question on the techniques that can be utilised to tune the forecasting models when the features are stable and the feature set is decomposed.
Once everything is tried, we should look to tune our machine learning models.
Tuning Machine Learning Model Is Like Rotating TV Switches and Knobs Until You Get A Clearer SignalThis diagram illustrates how parameters can be dependent on one another.
X Train — Training data of independent variables, also known as featuresX Test — Test data of independent variablesY Train — Training data of dependent variableY Test — Test data of dependent variableFor instance, if you are forecasting the volume of waterfall based on the temperature and humidity then the volume of water is represented as Y (dependent variable) and the Temperature and Humidity are the X (independent variables or features).
Training data of X is then known as X Train which you can use to train your model.
Hyperparameters are parameters of the models that can be input as arguments to the models.
Step 2: Cover The BasicsBefore you fine tune your forecasting model, it is important to briefly understand what machine learning is.
If you are new to machine learning then please have a look at this article:Machine Learning In 8 MinutesMachine learning is the present and the future.
All technologists, data scientists and financial experts can benefit…medium.
comIt is often easier to improve the data that we feed into the models than to fine tune parameters of the model.
If you want to improve accuracy of your forecasting model then please enrich data in the feature set first.
If you feed poor quality data in then the model will yield poor results.
Please have a look at this article that highlights common techniques to we can use to enrich the features:Processing Data To Improve Machine Learning Models AccuracyOccasionally we build a machine learning model, train it with our training data, and when we get it to predict future…medium.
comIf you are unsure whether your model is the most appropriate model for the problem then have a look at this article.
It reviews most common algorithms of the machine learning model:Machine Learning Algorithms ComparisonThere are a large number of Machine Learning (ML) algorithms available.
In this article, I am going to describe and…medium.
comStep 3: Find Your Score MetricThe most important pre-requisite is to decide on the metric that you are going to use to score the accuracy of the forecasting model.
It could be R squared, Adjusted R squared, Confusion Matrix, F1, Recall, Variance etc.
Read this article to understand the most important mathematical measures that every data scientist should know.
These measures are explained in an easy-to-understand manner:Must Know Mathematical Measures For Every Data ScientistThere are a large number of mathematical measures that every data scientist needs to be aware of.
This article outlines…medium.
comfrom sklearn.
metrics contains a large number of scoring metricsStep 4.
Obtain Accurate Forecasting ScoreOnce you have prepared your training set, enriched its features, scaled the data, decomposed the feature sets, decided on the scoring metric and trained your model on the training data then you should test the accuracy of the model on unseen data.
The unseen data is known as “test data”.
You can utilise cross validation to assess how your model works on unseen data.
This is known as generalisation error of your model.
Cross ValidationThere are two common cross validation methodologiesHoldout Cross ValidationIt is not a wise machine learning practice to train your model and score its accuracy on the same data set.
It is a far superior technique to test your model with varying model parameter values on an unseen test set.
It is a good practice to divide your data set into three parts:Training SetValidation SetTest SetTrain your model on the training set (60% of the data),then perform model selection (tuning parameters) on validation set (20% of the data) and once you are ready, test your model on the test set (20% of the data).
Create your training, validation and test data sets proportions according to the needs of your machine learning model and availability of the data.
K Fold Cross ValidationKFold cross validation is a superior mechanism than using holdout cross validation.
The way it works is that the data is divided into k folds (parts).
k-1 folds are used to train the model and the last fold is used to test the model.
This mechanism is then repeated k times.
Furthermore, each time a number of performance metrics can be used to assess and score the performance.
The average of the performance metrics are then reported.
The class proportions are preserved in StratifiedKFold.
Choose between 8–12 k foldsfrom sklearn.
cross_validation import cross_val_scorescores = cross_val_score(estimator=pipe_lr, X=X_train, y=Y_train, cv=12, n_jobs=)mean_scores = scores.
mean()n_jobs parameter controls the number of CPUs used to run the cross validation.
Step 5: Diagnose Best Parameter Value Using Validation CurvesOnce accurate forecasting scores have been established, find out all of the parameters that your model requires.
You can then use validation curves to explore how their values can improve the accuracy of the forecasting models.
Before we tune the parameters, we need to diagnose and find if the model is suffering from under or overfitting.
Models that have a large number of parameters tends to overfit.
We can use validation curves to resolve the issue of overfitting and underfitting in machine learning.
The parameters are also known as hyperparametersPhoto by Dominik Scythe on UnsplashValidation curve is utilised to pass in a range of values for model parameters.
It changes the values of the model parameters one at a time and then the accuracy values can be plotted against the model parameter value to assess the accuracy of the model.
For example, if your model takes a parameter named “number of trees” then you can test your model by passing in 10 different values of the parameter.
You can use validation curve to report accuracy on each of the parameter value to assess the accuracy.
Finally take the score that returns highest accuracy and gives you your required results within acceptable times.
Sci-kit learn offers validation curve module:from sklearn.
learning_curve import validation_curvenumber_of_trees= [1,2,3,4,5,6,7,99,1000]train_scores, test_scores = validation_curve(estimator=<PIPELINE>, … X=X_train,y=Y_train, param_range=number_of_trees, …)Step 6: Use Grid Search To Optimise Hyperparameter CombinationOnce we have retrieved optimum values of individual model parameters then we can use grid search to obtain combination of hyperparameter values of a model that can give us the highest accuracy.
Grid Search evaluates all possible combinations of the parameter values.
Grid Search is exhaustive and uses brute-force to evaluate the most accurate values.
Therefore it is computationally intensive task.
Use GridSearchCV of sci-kit learn to perform grid searchfrom sklearn.
grid_search import GridSearchCVStep 7: Continuously Tune The Parameters To Further Improve AccuracyThe key here is to always enhance the training set as soon as more data is available.
Always test your forecasting model on richer test data that the model has not seen before.
Always ensure that the right model and parameter values are chosen for the job.
It is important to feed more data as soon as it is available and test the accuracy of the model on continuous basis so that the performance and accuracy can be further optimised.
SummaryThis article provided an overview of the two key steps which can be utilised to further enhance the accuracy of your machine learning models.
Hope it helps.
.. More details