Munging CSV files with standard Unix tools

I’ll point out some of their quirks, which are arguments for using something else.

But the assumption here is that you don’t want to use something else.

Maybe you already know the standard utilities and don’t think that learning more specialized tools is worth the effort.

Maybe you’re already at the command line and in a command line state of mind, and don’t want to interrupt your work flow by doing something else.

Maybe you’re on a computer where you don’t have the ability to install any software and so you need to work with what’s there.

Whatever your reasons, we’ll go with the assumption that we’re committed to using commands that have been around for decades.

cut, sort, and awkThe tools I want to look at are cut, sort, and awk.

I wrote about cut the other day and apparently the post struck a chord with some readers.

This post is a follow-up to that one.

These three utilities are standard on Unix-like systems.

You can also download them for Windows from GOW.

The port of sort will be named gsort in order to not conflict with the native Windows sort function.

There’s no need to rename the other two utilities since they don’t have counterparts that ship with Windows.

The sort command is simple and useful.

There are just a few options you’ll need to know about.

The utility sorts fields as text by default, but the -n tells it to sort numerically.

Since we’re talking about CSV files, you’ll need to know that -t, is the option to tell sort that fields are separated by commas rather than white space.

And to specify which field to sort on, you give it the -k option.

The last utility, awk, is more than a utility.

It’s a small programming language.

But it works so well from the command line that you can almost think of it as a command line utility.

It’s very common to pipe output to an awk program that’s only a few characters long.

You can get started quickly with awk by reading Greg Grothous’ article Why you should learn just a little awk.

InconsistenciesNow for the bad news: these programs are inconsistent in their options.

The two most common things you’ll need to do when working with CSV files is to set your field delimiter to a comma and specify what field you want to grab.

Unfortunately this is done differently in every utility.

cut uses -d or –delimiter to specify the field delimiter and -f or –fields to specify fields.

Makes sense.

sort uses -t or –field-separator to specify the field delimiter and -k or –key to specify the field.

When you’re talking about sorting things, it’s common to call the fields keys, and so the way sort specifies fields makes sense in context.

I see no reason for -t other than -f was already taken.

(In sorting, you talk about folding upper case to lower case, so -f stands for fold.

)awk uses -F or –field-separator to specify the field delimiter.

At least the verbose option is consistent with sort.

Why -F for the short option instead of -f?.The latter was already taken for file.

To tell awk to read a program from a file rather than the command line you use the -f option.

awk handles fields differently than cut and sort.

Because it is a programming language designed to parse delimited text files, each field has a built-in variable: $1 holds the content of the first field, $2 the second, etc.

The following compact table summarizes how you tell each utility that you’re working with comma-separated files and that you’re interested in the second field.

|——+—–+—–| | cut | -d, | -f2 | | sort | -t, | -k2 | | awk | -F, | $2 | |——+—–+—–| Trade-offsSome will object that the inconsistencies documented above are a good example of why you shouldn’t work with CSV files using cut, sort, and awk.

You could use other command line utilities designed for working with CSV files.

Or pull your CSV file into R or Pandas.

Or import it somewhere to work with it in SQL.

Etc.

The alternatives are all appropriate for different uses.

The premise here is that in some circumstances, the inconsistencies cataloged above are a regrettable but acceptable price to pay to stay at the command line.

RelatedWorking with wide text files at the command lineA little awkLearn one sed commandUnix tooltip[1] Things get complicated if you have a CSV file and fields contain commas inside strings.

Tab-separated files are more convenient in this case, unless, of course, your strings contain tabs.

The utilities mentioned here all support tab as a delimiter by default.

.. More details

Leave a Reply