Synthetic data generation — a must-have skill for new data scientists

However, although its ML algorithms are widely used, what is less appreciated is its offering of cool synthetic data generation functions.scikit-learn: machine learning in Pythonscikit-learn.orgHere is a quick rundown,Regression problem generation: Scikit-learn’s dataset.make_regression function can create random regression problem with arbitrary number of input features, output targets, and controllable degree of informative coupling between them..It can also mix Gaussian noise.Fig: Random regression problem generation using scikit-learn with varying degree of noise.Classification problem generation: Similar to the regression function above, dataset.make_classification generates a random multi-class classification problem (dataset) with controllable class separation and added noise..You can also randomly flip any percentage of output signs to create a harder classification dataset if you want.Fig: Random classification problem generation using scikit-learn with varying class separation.Clustering problem generation: There are quite a few functions for generating interesting clusters..The most straightforward one is datasets.make_blobs, which generates arbitrary number of clusters with controllable distance parameters.Fig: Simple cluster data generation using scikit-learn.Anisotropic cluster generation: With a simple transformation using matrix multiplication, you can generate clusters which is aligned along certain axis or anisotropically distributed.Fig: Anisoproically aligned cluster data generation using scikit-learn.Concentric ring cluster data generation: For testing affinity based clustering algorithm or Gaussian mixture models, it is useful to have clusters generated in a special shape..We can use datasets.make_circles function to accomplish that.And, of course we can mix a little noise to the data to test the robustness of the clustering algorithm,Moon-shaped cluster data generation: We can also generate moon-shaped cluster data for testing algorithms, with controllable noise using datasets.make_moons function.Data generation with arbitrary symbolic expressionsWhile the aforementioned functions are great to start with, the user have no easy control over the underlying mechanics of the data generation and the regression output are not a definitive function of inputs — they are truly random..While this may be sufficient for many problems, one may often require a controllable way to generate these problems based on a well-defined function (involving linear, nonlinear, rational, or even transcendental terms).For example, we want to evaluate the efficacy of the various kernelized SVM classifiers on datasets with increasingly complex separators (linear to non-linear) or want to demonstrate the limitation of linear models for regression datasets generated by rational or transcendental functions..It will be difficult to do so with these functions of scikit-learn.Moreover, user may want to just input a symbolic expression as the generating function (or the logical separator for classification task)..There is no easy way to do so using only scikit-learn’s utility and one has to write his/her own function for each new instance of the experiment.For solving the problem of symbolic expression input, one can easily take advantage of the amazing Python package SymPy, which allows comprehension, rendering, and evaluation of symbolic mathematical expressions up to a fairly high level of sophistication.In one of my previous articles, I have laid out in detail, how one can build upon the SymPy library and create functions similar to those available in scikit-learn, but can generate regression and classification datasets with symbolic expression of high degree of complexity..Check out that article here and my Github repository for the actual code.Random regression and classification problem generation with symbolic expressionWe describe how using SymPy, we can set up random sample generators for polynomial (and nonlinear) regression and…towardsdatascience.comFor example, we can have a symbolic expression as a product of a square term (x²) and a sinusoidal term like sin(x) and create a randomized regression dataset out of that.Fig: Randomized regression dataset with symbolic expression: x².sin(x)Or, one can generate a non-linear elliptical classification boundary based dataset for testing a neural network algorithm..Note, in the figure below, how the user can input a symbolic expression m='x1**2-x2**2' and generate this dataset.Fig: Classification samples with non-linear separator.Categorical data generation using “pydbgen” libraryWhile many high-quality real-life datasets are available on the web for trying out cool machine learning techniques, from my personal experience, I found that the same is not true when it comes to learning SQL.For data science expertise, having a basic familiarity of SQL is almost as important as knowing how to write code in Python or R..But access to a large enough database with real categorical data (such as name, age, credit card, SSN, address, birthday, etc.) is not nearly as common as access to toy datasets on Kaggle, specifically designed or curated for machine learning task.Apart from the beginners in data science, even seasoned software testers may find it useful to have a simple tool where with a few lines of code they can generate arbitrarily large data sets with random (fake) yet meaningful entries.Enter pydbgen..Read the docs here.It is a lightweight, pure-python library to generate random useful entries (e.g. name, address, credit card number, date, time, company name, job title, license plate number, etc.) and save them in either Pandas dataframe object, or as a SQLite table in a database file, or in a MS Excel file.Introducing pydbgen: A random dataframe/database table generatorA lightweight Python package for generating random database/dataframe to use in data science, learning SQL, machine…towardsdatascience.comYou can read the article above for more details.. More details

Leave a Reply