Random sampling from a file

I recently learned about the Linux command line utility shuf from browsing The Art of Command Line.

This could be useful for random sampling.

Given just a file name, shuf randomly permutes the lines of the file.

With the option -n you can specify how many lines to return.

So it’s doing sampling without replacement.

For example, shuf -n 10 foo.

txtwould select 10 lines from foo.

txt.

Actually, it would select at most 10 lines.

You can’t select 10 lines without replacement from a file with less than 10 lines.

If you ask for an impossible number of lines, the -n option is ignored.

You can also sample with replacement using the -r option.

In that case you can select more lines than are in the file since lines may be reused.

For example, you could run shuf -r -n 10 foo.

txtto select 10 lines drawn with replacement from foo.

txt, regardless of how many lines foo.

txt has.

For example, when I ran the command above on a file containing alpha beta gamma I got the output beta gamma gamma beta alpha alpha gamma gamma beta I don’t know how shuf seeds its random generator.

Maybe from the system time.

But if you run it twice you will get different results.

Probably.

RelatedRandomization that will stand up in courtVolatility in adaptive randomization clinical trialsHow to test a random number generator.. More details

Leave a Reply