Shell Basics every Data Scientist should know

It is providing us a way to use our basic commands consecutively.

There are a lot of commands that are relatively basic, and it lets us use these basic commands in sequence to do some fairly non-trivial things.

Now let me tell you about a couple of more commands before I show you how we can chain them to do reasonably advanced tasks.

4.

wc:wc is a fairly useful shell utility/command that lets us count the number of lines(-l), words(-w) or characters(-c) in a given file5.

grep:You may want to print all the lines in your file which have a particular word.

Or you might like to see the salaries for the team BAL in 2000.

In this case, we have printed all the lines in the file which contain “2000|BAL”.

grep is your friend.

You could also use regular expressions with grep.

6.

sort:You may want to sort your dataset on a particular column.

Sort is your friend.

Say you want to find out the top 10 maximum salaries given to any player in your dataset.

So there are indeed a lot of options in this command.

Let’s go through them one by one.

-t: Which delimiter to use?-k: Which column to sort on?-n: If you want Numerical Sorting.

Don’t use this option if you wish to do Lexographical sorting.

-r: I want to sort Descending.

Sorts Ascending by Default.

7.

cut:This command lets you select specific columns from your data.

Sometimes you may want to look at just some of the columns in your data.

As in you may want to look only at the year, team and salary and not the other columns.

cut is the command to use.

The options are:-d: Which delimiter to use?-f: Which column/columns to cut?8.

uniq:uniq is a little bit tricky as in you will want to use this command in sequence with sort.

This command removes sequential duplicates.

So in conjunction with sort, it can be used to get the distinct values in the data.

For example, if I wanted to find out ten distinct teams in data, I would use:This command could be used with an argument -c to count the occurrence of these distinct values.

Something akin to count distinct.

Some Other Utility Commands for Other OperationsSome Other command line tools that you could use without going in the specifics as the specifics are pretty hard.

1.

Change delimiter in a file:Find and Replace Magic.

: You may want to replace certain characters in the file with something else using the tr command.

2.

Sum of a column in a file:Using the awk command, you could find the sum of a column in a file.

Divide it by the number of lines, and you can get the mean.

awk is a powerful command which is a whole language in itself.

Do see the wiki page for awk for a lot of good use cases of awk.

I also wrote a post on awk as the second part of this series.

Check it HERE3.

Find the files in a directory that satisfy a specific condition:You can do this by using the find command.

Let’s say you want to find all the .

txt files in the current working dir that start with A.

To find all .

txt files starting with A or B we could use regex.

Other Cool Tricks:Sometimes you want your data that you got by some command line utility(Shell commands/ Python scripts) not to be shown on stdout but stored in a text file.

You can use the ”>” operator for that.

For Example, You could have stored the file after replacing the delimiters in the previous example into another file called newdata.

txt as follows:cat data.

txt | tr ',' '|' > newdata.

txtI got confused between ”|” (piping) and ”>” (to_file) operations a lot in the beginning.

One way to remember is that you should only use ”>” when you want to write something to a file.

”|” cannot be used to write to a file.

Another operation you should know about is the ”>>” operation.

It is analogous to ”>” but it appends to an existing file rather than replacing the file and writing over.

If you would like to know more about the command line, which I guess you would, there is The UNIX workbench course on Coursera which you can try out.

So, this is just the tip of the iceberg.

Although I am not an expert in shell usage, these commands reduced my workload to a large extent.

If there are some shell commands, you use regularly or some shell command that are cool, do tell in the comments.

I would love to include it in the blog post.

I wrote a blog post on awk as the second part of this post.

Check it HereOriginally published at mlwhiz.

com .

.

. More details

Leave a Reply