How to produce meaningful datasets using only SQL

Now we have something we can use in multiple scenarios.

Let the real fun begin!Random boolean valuesWith our new randomNumber function, Booleans now become very easy as well since PostgreSQL recognizes 0as false and 1 as true.

Select Random BooleanLet’s turn this one into a function as well, which will come in handy very shortly:Random Boolean FunctionNow let’s give this a trial run to make sure it’s working like we expect:Select Random BooleansHalf of the random boolean values are TRUE and half are FALSE — perfect!Weighted random booleansMany times you need to simulate authentic variation in your seed data, and weighting is one way to accomplish this.

For example, I may want to generate boolean values for a population of data where the majority of values should be false.

We’re going to accomplish this with yet another function so that we can easily reference it in SQL:Random Weighted Boolean FunctionNow, let’s return to my previous hypothetical scenario.

If I want the false values to be the majority, I can simply pass in a lower trueWeight.

Since our base randomNumber function returns values between 0and 1, I’ll need to represent trueWeight as a decimal “percentage”.

The other part of testing this out is to generate enough values to determine if my weighting logic is working sufficiently over a larger population.

For this we’ll use the extremely handy PostgreSQL function generate_series.

Select Random Weighted BooleanSo, across 100 iterations, our weighted boolean function generated approximately 75% false values!Random row(s) from a tableAnother interesting facet of the random function is that is can be used in the ORDER BY clause.

Let’s take a look at an example by generating a sample data set (again, using generate_series) with and without a randomized ordering.

Randomized OrderingPretty cool!.Now we can easily grab a random, single row from any given data set using the PosrgresSQL LIMIT clause:Random Row SelectionRandom value from an enumerated listThis is one of the most common scenarios I have run across.

We need randomized data, but the values should be restricted to an enumerated list that we know in advance.

For example, say we wanted to return a random value from the list [Cyan, Magenta, Yellow, Black].

The first challenge will be to turn our list into a data set.

This is easily done by leveraging PostgreSQL ARRAYs.

Let’s give it a try:Select List to ArrayOur list of values has been converted to an ARRAY (notice the curly braces {} around the list).

But how do we return the values as individual rows?.PostgreSQL includes a nifty array function named unnest that does this for us:Unnest ArrayBuilding on this concept as well as several of our previous examples, we can create another function that enables us to use this approach for any enumerated list.

We’ll make special use of the [] notation for defining a TEXT array, since treating all input values as TEXT will give us the greatest flexibility.

Random Value From List FunctionNow let’s test out our new function:Select Random Values From ListsWow!.I hope you’re having as much fun as I am!.????.????Random textThis is an interesting one.

We can easily generate a random string.

Notice I did not say “meaningful”.

Here’s one way:Random StringWe can also shorten it to the desired length (note: using the md5 function will only generate a 32-byte string):Random String of Specified LengthWe can also manipulate the text case:Random String of Specified Length and CaseThis will only provide limited value, especially if you need longer strings that are more meaningful (e.

g.

person names, addresses, etc).

To address the length concern, let’s start with an example I originally found posted by user Lyndon S on StackOverflow.

Here is my adaptation, which will produce a random list of 10 characters from the English alphabet:Random English CharactersBuilding on that, we can use our old friend generate_series along with a new friend string_agg to generate random “words” of a specified length.

Note that I have artificially limited the maximum string length to 100 characters:Select Random WordsSo we have a working solution for random “words”, but this will not suffice for use cases that require simulated “real” data.

To randomly generate truly meaningful strings requires having a list of meaningful values stored in a table.

I have done this before by pre-populating a “dummy names” table from which I can, for example, randomly select a random first name and last name for a person.

Again, this is also where third party libraries like faker.

js come in handy.

Miscellaneous use casesUsing a combination of some of the things we have learned so far (and a few new concepts), here are several interesting use cases I have come across that required randomized seed data:Body TemperaturePhone NumberFuture date in the next 12 monthsRandom date within the past 30 daysRandom date of birth for a person under the age of 18I hope this gives you an idea just how easy it is to accomplish some pretty powerful things using basic SQL.

All the SQL Tips in this article (plus more in the future) can be found here.

I’d love to hear your favorite tips and clever SQL solutions, so please share them in the comments below!.

. More details

Leave a Reply