Replacing Excel with Python

Replacing Excel with PythonAnkit GandhiBlockedUnblockFollowFollowingMar 4After spending almost a decade with my first love Excel, its time to move on and search for a better half who in thick and thin of my daily tasks is with me and is much better and faster and who can give me a cutting edge in the challenging technological times where new technology is getting ditched by something new at a very rapid pace.

The idea is to replicate almost all excel functionalities in Python, be it using a simple filter or a complex task of creating an array of data from the rows and crunching them to get fancy resultsThe approach followed here is to start from simple tasks and move to complex computational tasks.

I will encourage you to please replicate the steps yourself for your better understanding.

The inspiration to create something like this came from the non-availability of a free tutorial which literally gives all.

I heavily read and follow Python documentation and you will find a lot of inspiration from that site.

GitHub repo linkhttps://github.

com/ank0409/Ditching-Excel-for-PythonImporting Excel Files into a Pandas DataFrameInitial step is to import excel files into DataFrame so we can perform all our tasks on it.

I will be demonstrating the read_excel method of Pandas which supports xls and xlsx file extensions.

 read_csv is same as using read_excel, we wont go in depth but I will share an example.

Though read_excel method includes million arguments but I will make you familiarise with the most common ones that will come very handy in day to day operations.

Though read_excel method includes million arguments but I will make you familiarise with the most common ones that will come very handy in day to day operationsI’ll be using the Iris sample dataset which is freely available online for educational purpose.

Please follow the below link to download the dataset and ensure to save it in the same folder where you are saving your python filehttps://archive.

ics.

uci.

edu/ml/datasets/irisThe first step is to import necessary libraries in PythonWe can import the spreadsheet data into Python using the following code:pandas.

read_excel(io, sheet_name=0, header=0, names=None, index_col=None, parse_cols=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skip_footer=0, skipfooter=0, convert_float=True, mangle_dupe_cols=True, **kwds)Since there’s a plethora of arguments available, lets look at the most used one’s.

Important Pandas read_excel optionsIf we are using the path for our local file by default its separated by “” however python accepts “/”, so make to change the slashes or simply add the file in the same folder where your python file is.

Should you require detailed explanation on the above, refer to the below medium article.

https://medium.

com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5fWe can use Python to scan files in a directory and pick out the ones we want.

Import a specific sheetBy default, the first sheet in the file is imported to the dataframe as it is.

 Using the sheet_name argument we can explicitly mention the sheet that we want to import.

Default value is 0 i.

e.

the first sheet in the file.

 We can either mention the name of the sheet(s) or pass an integer value to refer to the index of the sheetUsing a column from the sheet as an IndexUnless explicitly mentioned, an index column is added to the DataFrame which by default starts from a 0.

 Using the index_col argument we can manipulate the index column in our dataframe, if we set the value 0 from none, it will use the first column as our index.

Skip rows and columnsThe default read_excel parameters assumes that the first row is a list of column names, which is incorporated automatically as column labels within the DataFrame.

 Using the arguments like skiprows and header we can manipulate the behavior of the imported DataFrame.

Import specific column(s)Using the usecols argument we can specify if we have import a specific column in our DataFrameIts not the end of the features available however its a start and you can play around with them as per your requirementsLets have a look at the data from 10,000 feetAs now we have our DataFrame, lets look at the data from multiple angles just to get a hang of it/ Pandas have plenty of functions available that we can use.

We’ll use some of them to have a glimpse of our dataset.

“Head” to “Tail”:To view the first or last five rows.

 Default is five, however the argument allows us to use a specific numberView specific column’s dataGetting the name of all columnsInfo MethodGives the summary of DataFrameShape MethodReturns the dimensions of DataFrameLook at the datatypes in DataFrameSlice and Dice i.

e.

Excel filtersDescriptive reporting is all about data subsets and aggregations, the moment we are to understand our data a little bit we start using filters to look at the smaller sets of data or view a particular column maybe to have a better understanding.

 Python offers a lot of different methods to slice and dice the DataFrames, we’ll play around with a couple of them to have an understanding of how it worksView a specific columnThere exists three main methods to select columns:Use dot notation: e.

g.

data.

column_nameUse square braces and the name of the column:, e.

g.

data[‘column_name’]Use numeric indexing and the iloc selector data.

loc[:, ‘column_number’]View multiple columnsView specific row’s dataThe method used here is slicing using the loc function, where we can specify the start and end row separated by colon Remember, index starts from a 0 and not 1Slice rows and columns togetherFilter data in a columnFilter multiple valuesFilter multiple values using a listFilter values NOT in list or not equal to in ExcelFilter using using multiple conditions in multiple columnsThe input should always be a list We can use this method to replicate advanced filter function in excelFilter using numeric conditionsReplicate the custom filter in ExcelCombine two filters to get the resultContains function in ExcelGet the unique values from DataFrameIf we want to view the entire DataFrame with the unique values, we can use the drop_duplicates methodSort ValuesSort data by a certain column, by default the sorting is ascendingStatistical summary of dataDataFrame Describe methodGenerate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN valuesSummary stats of character columnsData AggregationCounting the unique values of a particular columnResulting output is a Series.

You can refer it as a Single column Pivot TableCount cellsCount non-NA cells for each column or rowSumSummarising the data to get a snapshot of either by rows or columnsReplicates the method of adding a total column against each rowAdd a total column to the existing datasetSum of specific columns, use the loc methods and pass the column namesOr, we can use the below methodDon’t like the new column, delete it using drop methodAdding sum-total beneath each columnA lot has been done above, the approach that we are using is:Sum_Total: Do the sum of columnsT_Sum: Convert the series output to DataFrame and transposeRe-index to add missing columnsRow_Total: append T_Sum to existing DataFrameSum based on criteria i.

e.

Sumif in ExcelSumifsAverageifMaxMinGroupby i.

e.

Subtotals in ExcelPivot Tables in DataFrames i.

e.

Pivot Tables in ExcelWho doesn’t love a Pivot Table in Excel, its one the best ways to analyse your data, have a quick overview of the information, helps you slice and dice the data with a super easy interface, helps you plots graphs basis on the data, add calculative columns etc.

 No, we wont have an interface to work, we’ll have to explicitly write the code to get the output, No, it wont generate charts for you, but I don’t think we can complete a tutorial without learning about the Pivot tables.

A simple Pivot table showing us the sum of SepalWidth in values, SepalLength in Row Column and Name in Column LabelsLets see if we can complicate it a bit.

Blanks are now replaced with 0’s by using the fill_value argumentWe can have individual calculations on values using dictionary method and can also have multiple calculations on valuesIf we use margins argument, we can have total row addedVlookupWhat a magical formula is vlookup in Excel, I think its the first thing that everyone wants to learn before learning how to even add.

Looks fascinating when someone is applying vlookup, looks like magic when we get the output.

Makes life easy.

I can with very much confidence can say its the backbone of every data wrangling action performed on the spreadsheet.

Unfortunately we don’t have a vlookup function in Pandas!Since we don’t have a “Vlookup” function in Pandas, Merge is used as an alternate which is same as SQL.

There are a total of four merge options available:‘left’ — Use the shared column from the left DataFrame and match to right DataFrame.

Fill in any N/A as NaN‘right’ — Use the shared column from the right DataFrame and match to left DataFrame.

Fill in any N/A as NaN‘inner’ — Only show data where the two shared columns overlap.

Default method.

‘outer’ — Return all records when there is a match in either left or right DataFrame.

The above might not be the best example to support the concept, however the working is the same.

As they say, “No perfect tutorial exists”, nor does mine :).

. More details

Leave a Reply