How to Filter Rows of a Pandas DataFrame by Column Value

How to Filter Rows of a Pandas DataFrame by Column ValueTwo simple ways to filter rowsStephen FordhamBlockedUnblockFollowFollowingApr 19Quite often it is a requirement to filter tabular data based on a column value.

We may be presented with a Table, and want to perform custom filtering operations.

Fortunately, we can ultilise Pandas for this operation.

Pandas is an open source Python library for data analysis.

It gives Python the ability to work with spreadsheet-like data enabling fast file loading and manipulation among other functions.

In order to achieve these features Pandas introduces two data types to Python: the Series and DataFrame.

This tutorial will focus on two easy ways to filter a Dataframe by column value.

The following example is the result of a BLAST search.

In bioinformatics, BLAST (basic local alignment search tool) is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences.

This table has an ID for the query sequence expressed as a number and the species to which it matches to, among other metadata.

This example is completely arbitrary, the following techniques can be applied in a vast array of other situations.

To begin, we import the pandas library, and give it the alias pdI then use the pandas read_csv() function to read in my DataFrame.

I saved my DataFrame with the file extension .

csv (a comma separated value file).

Following this, I perform some basic operations to familiarize myself with the DataFrame.

Shown below, are the three operations I commonly use.

These include: printing the first 5 rows using the head method, and accessing the column names using the column attribute of the DataFrame object.

Finally, I look at the shape attribute of the DataFrame object.

The output from these three print functions are shown below (for enhanced presentation, I recommend running the same code with the Jupyter Notebook, this will display pandas DataFrame objects as a more browser-friendly HTML table).

The shape attribute returns a tuple, which gives the number of rows on the left hand side on the comma, and the number of columns on the right hand side.

This DataFrame has 29 rows and 5 columns.

Note, however that only the first 5 rows are displayed.

Say we want to filter this DataFrame, so that any entries in the Species_name_blast_hit column that are ‘bacterium’ are excluded.

How would we go about doing this?I will walk through 2 ways of selective filtering of tabular data.

To begin, I create a Python list of Booleans.

I then write a for loop which iterates over the Pandas Series (a Series is a single column of the DataFrame).

The Pandas Series, Species_name_blast_hit is an iterable object, just like a list.

I then use a basic regex expression in a conditional statement, and append either True if ‘bacterium’ was not in the Series value, or False if ‘bacterium’ was present.

When I print the first 5 entries of my Boolean lists, all the results are True.

This is exactly as expected, we can see from the Table earlier, that the first 5 entries in Species_name_blast_hit do not have bacteria in their name.

Here, I have created a list of Booleans which are the same length as the DataFrame, 29.

Following this, I convert the Boolean list into a Pandas Series and assigned it the variable name, Filtered.

You can create a Pandas Series by passing in a list to the pd.

Series() function.

Next, I use Boolean subsetting/indexing on my original Pandas DataFrame, Blast using square brackets notation and assign the new DataFrame the variable name New_blast_df.

Here, I write the original DataFrame, Blast, followed by square brackets with the Pandas Series, Filtered inside.

This series indicates which rows to select, because it is composed of True and False Values that correspond to rows in the Blast data-set.

As is clearly demonstrated from the Jupyter Notebook output, all ‘bacterium’ entries have been omitted!We can further confirm this, by checking the number of rows removed from the original DataFrame, by subtracting the first index of the tuple of Blast returned from the shape attribute from the first index of New_blast_df returned from the shape attribute.

Now, we can see that 8 rows were removed.

Fortunately, there is an additional way to subset your DataFrame by column value which is much simpler.

For this example, let’s assume we only want to include rows if they contain the species name 'Phoxinus’.

This way uses the contains method.

The contains method returns a Boolean array if each string contains the pattern.

To use it, you need to enter the name of your DataFrame, then use dot notation to select the appropriate column name of interest, followed by .

str and finally contains().

The contains method can also find partial name entries and therefore is incredibly flexible.

By default .

str.

contains is case sensitive.

To disregard case sensitivity, simply set the keyword argument case to False, .

str.

contains(case=False).

This adds further flexibility, but must be used with caution.

ConclusionThis tutorial has demonstrated 2 ways to filter tabular data based on column values.

Whilst the examples presented here make use of a single condition, multiple conditions can easily be applied to further fine tune the filtered output.

.

. More details

Leave a Reply