Random regression and classification problem generation with symbolic expression

Random regression and classification problem generation with symbolic expressionWe describe how using SymPy, we can set up random sample generators for polynomial (and nonlinear) regression and classification problems..The output is generated by applying a (potentially biased) random linear regression model with a definite number of nonzero regressors to the previously generated input and some Gaussian centered noise with some adjustable scale.sklearn.dataset.make_classification: Generate a random n-class classification problem..Here are few basic examples,Random regression and classification dataset generation using symbolic expression supplied by userThe details of code can be found in my GitHub repo, but the idea is simple..However if no symbolic expression is supplied then a default simple polynomial can be invoked to generate classification samples with n_features.flip_y: Probability of flipping the classification labels randomly..However if no symbolic expression is supplied then a default simple polynomial can be invoked to generate classification samples with n_features.noise: Magnitude of noise (default Gaussian) to be introduced (added to the output).noise_dist: Type of the probability distribution of the noise signal..Currently supports: Normal, Uniform, Beta, Gamma, Poission, Laplace.Return: Returns a numpy ndarraywith dimension (n_samples,n_features+1). Last column is the response vector.ExamplesHere are few code snippets and resulting data sets visualized.Classification SamplesRegression SamplesNot limited to single symbolic variableAlthough the above examples are shown using one or two examples, the functions are not limited by number of variables. In fact, the internal methods are coded to automatically infer the number of independent variables from your symbolic expression input and sets up the problem accordingly. Here is an example, where n_features are not even given by the user but the function infers the number of features to be 3 from the symbolic expression.Summary and future expansionsThe basic code is set up to mimic the scikit-learn’s dataset generation utility functions as closely as possible. One can easily extend it by providing a Pandas DataFrame output or a CSV file output for using in any other programming environment and saving the data to the local disk. Up to a certain degree of complexity, it is also possible to provide user with a string representation of the LaTeX formula for the symbolic expression. Readers are certainly encouraged to send their comments or indicate in the GitHub repo.If you have any questions or ideas to share, please contact the author at tirthajyoti[AT]gmail.com. Also, you can check author’s GitHub repositories for other fun code snippets in Python, R, or MATLAB and machine learning resources. If you are, like me, passionate about machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.If you liked this article, please don’t forget to leave a clap :-)Keywords: #machinelearning, #symbolicmath, #supportvectormachine, #randomization, #regression, #classification. More details

Leave a Reply