Object-oriented programming for data scientists: Build your ML estimator

In spite of being different, they have the commonality that they can both be imagined to be essential parts of the overall linear regression process — we expect a linear regression to fit some training data, and then be able to predict for future unseen data.

We also expect the linear regression model to provide us some indication about how good the fit was — generally in the form of a single numeric quantity or score called coefficient of regression or R².

As expected, we see a function score, which returns exactly that R² number, also hanging around fitand predict.

Neat and clean, isn’t it?Data, functions, and parameters are cohabitating inside a single logical unit.

How was it made possible?It was possible because we rose above the individual differences and thought about the linear regression as a high-level process and decided what essential actions it should serve and what critical parameters it should inform its users about.

We made a high-level class called LinearRegression under which all those apparently disparate functions can be grouped together for easy book-keeping and enhanced usability.

Once we imported this class from the library, we just had to create an instance of the class — we called it lm.

That’s it.

All the functions, grouped under the class, became accessible to us through that newly defined instance lm.

If we are not satisfied with some of the internal implementation of the functions, we can work on them and re-attach them to the main class after modification.

Only the code of the internal function changes, nothing else.

See, how logical and scalable it sounds?Create your own ML estimatorTraditional introduction to OOP will have plenty of examples using classes such as — animals, sports, geometric shapes.

But for data scientists, why not illustrate the concepts using the example of an object they use every day in their code — a machine learning estimator.

Just like the lm object from the Scikit-learn library, shown in the picture above.

A good, old Linear Regression estimator — with a twistIn this Github repo, I have shown, step-by-step, how to build a simple linear regression (single or multivariate) estimator class following the OOP paradigm.

Yes, it is the good old linear regression class.

It has the usual fit and predict methods as in the LinearRegression class from Scikit-learn.

But it has more functionalities.

Here is a sneak peek…Yes, this estimator is richer than the Scikit-learn estimator in the sense that it has, in addition to standard fit, predict, and R² score functions, a host of other utilities which are essential for a linear regression modeling task.

Especially, for data scientists and statistical modeling folks — who not only want to predict but also would like tomeasure the Goodness of fit,verify the assumptions of the linear regression,check for multicollinearity in the data, ordetect outliers.

How do you check the quality of your regression model in Python?Linear regression is rooted strongly in statistical learning and therefore the model must be checked for the ‘goodness…towardsdatascience.

comHow do you start building the class?We start with a simple code snippet to define the class.

We name it — MyLinearRegression.

Here, self denotes the object itself and __init__ is a special function which is invoked when an instance of the class is created somewhere in the code.

As the name suggests, __init__ can be used to initialize the class with necessary parameters (if any).

We can add a simple description string to keep it honest :-)We add the core fit method next.

Note the docstring describing the purpose of the method, what it does and what type of data it expects.

All of these are part of good OOP principles.

We can generate some random data to test our code so far.

We create a linear function of two variables.

Here are the scatter plots of the data.

Now, we can create an instance of the class MyLinearRegression called mlr.

What happens if we try to print the regression parameters?Because the self.

coef_ was set to None, we get the same while trying to print mlr.

coef_.

Note, how the self became synonymous to the instance of the class — mlr once it is created.

But the definition of fit includes setting the attributes once the fitting is done.

Therefore, we can just call mlr.

fit() and print out the fitted regression parameters.

The quintessential Predict methodAfter fitting, comes prediction.

We can add that method easily to our regression class.

What if we want to add a (or a few) plotting utility function?At this point, we start expanding our regression class and add stuff which is not even present in the standard scikit-learn class!.For example, we always want to see how the fitted values compare to the ground truth.

It is easy to create a function for that.

We will call it plot_fitted.

Note that a method is like a normal function.

It can take additional arguments.

Here, we have an argumentreference_line (default set to False) which draws a 45-degree reference line on the fitted vs true plot.

Also, note the docstring description.

We can test the methodplot_fitted by simply doing the following,m = MyLinearRegression()m.

fit(X,y)m.

plot_fitted()Or, we can opt to draw the reference line,m.

plot_fitted(reference_line=True)We get the following plots!Once we understood that we can add any useful methods to work on the same data (a training set), related to the same purpose (linear regression), there is no bound to our imagination!.How about we add the following plots to our class?Pairplots (plots pairwise relation between all features and outputs, much like the pairs function in R)Fitted vs.

residual plot (this falls under diagnostic plots for the linear regression i.

e.

to check the validity of the fundamental assumptions)Histogram and the quantile-quantile (Q-Q) plot of the residuals (this checks for the assumption of Normality of the error distribution)Inheritance — don’t overburden your main classAs we enthusiastically plan utility methods to add to the class, we recognize that this approach may make the code of the main class very long and difficult to debug.

To solve the conundrum, we can make use of another beautiful principle of OOP — inheritance.

Inheritance in Python – GeeksforGeeksInheritance is the capability of one class to derive or inherit the properties from some another class.

The benefits of…www.

geeksforgeeks.

orgWe further recognize that all plots are not of the same type.

Pairplots and fitted vs.

true data plots are of similar nature as they can be derived from the data only.

Other plots are related to the goodness-of-fit and residuals.

Therefore, we can create two separate classes with those plotting functions — Data_plots and Diagnostic_plots.

And guess what!.We can define our main MyLinearRegression class in terms of these utility classes.

That is inheritance.

Data_plots classDiagnostics_plots classAnd the definition of MyLinearregression is changed only slightly,class MyLinearRegression(Diagnostics_plots,Data_plots): def __init__(self, fit_intercept=True): self.

coef_ = None self.

intercept_ = None self.

_fit_intercept = fit_intercept.

By simply passing on the reference of Data_plots and Diagnostics_plots to the definition of MyLinearRgression class, we inherit all the methods and properties of those classes.

Now, to check the Normality assumptions of the error terms, we can simply fit the model and run those methods.

m = MyLinearRegression() # A brand new model instancem.

fit(X,y) # Fit the model with some datam.

histogram_resid() # Plot histogram of the residualsm.

qqplot_resid() # Q-Q plot of the residualsWe get,Again, the separation of code is at work here.

You can modify and improve the core plotting utilities without touching the main class.

Highly flexible and less error-prone approach!Do more with the power of OOPWe will not elaborate further on the various utility classes and methods we can add to MyLinearRegression.

You can check the Github repo.

Additional classes addedJust for completeness, we added,A class Metrics for computing various regression metrics — SSE, SST, MSE, R², and Adjusted R².

A class Outliers to plot Cook’s distance, leverage, and influence plotsA class Multicollinearity to compute variance inflation factors (VIF)All in all, the grand scheme looks like following,Is this class richer than the Scikit-learn’s LinearRegression class?.You decide.

Add syntactic sugar by creating grouped utilitiesOnce you have inherited other classes, they behave just like the usual Python module you are familiar with.

So, you can add utility methods to the main class to execute multiple methods from a sub-class together.

For example, the following method runs all the usual diagnostics checks at once.

Note how we are accessing the plot methods by putting a simple .

DOT i.

e.

Diagnostics_plot.

histogram_resid.

Just like accessing a function from Pandas or NumPy library!run_diagnostics method in the main classWith this, we can run all the diagnostics with a single line of code after fitting data.

m = MyLinearRegression() # A brand new model instancem.

fit(X,y) # Fit the model with some datam.

run_diagnostics()Similarly, you can add all the outlier plots in a single utility method.

Modularization — import the class as a moduleAlthough not a canonical OOP principle, the essential advantage of following the OOP paradigm is to be able to modularize your code.

You can experiment and develop all this code in a standard Jupyter notebook.

But for maximum modularity, consider converting the Notebook into a standalone executable Python script (with a .

py extension).

As a good practice, remove all the unnecessary comments and test code from this file and keep only the classes together.

Here is the link to the script I put together for this article.

Once you do that, you can import the MyLinearRgression class from a completely different Notebook.

This is often the preferred way of testing your code as this does not touch the core model but only tests it with various data samples and functional parameters.

At this point, you can consider putting this Python script on a Github, creating a Setup.

py file, creating the proper directory structure, and releasing it as a standalone linear regression package which does fitting, prediction, plotting, diagnostics, and more.

Of course, you have to add a lot of docstring description, examples of usage of a function, assertion checks, and unit tests to make it a good package.

But as a data scientist, now you have added a significant skill to your repertoire – software development following OOP principles.

It was not so difficult, was it?EpilogueTo write this post, I was inspired by this fantastic article, which drills down to the concept of OOP in Python in more detail with a context of machine learning.

I wrote a similar article, touching even more basic approaches, in the context of deep learning.

Check it out here,How a simple mix of object-oriented programming can sharpen your deep learning prototypeBy mixing simple concepts of object-oriented programming, like functionalization and class inheritance, you can add…towardsdatascience.

comIf you have any questions or ideas to share, please contact the author at tirthajyoti[AT]gmail.

com.

Also, you can check the author’s GitHub repositories for other fun code snippets in Python, R, or MATLAB and machine learning resources.

If you are, like me, passionate about machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.

Tirthajyoti Sarkar – Sr.

Principal Engineer – Semiconductor, AI, Machine Learning – ON…www.

linkedin.

com.. More details

Leave a Reply