Interpreting Data through Visualization with Python Matplotlib

Interpreting Data through Visualization with Python MatplotlibWhat did the IBM Data Visualization Course Teach Me ?SaptashwaBlockedUnblockFollowFollowingFeb 9Matplotlib even though is aging, still remains as one of the most vital tools for data visualization, and this post is about using matplotlib effectively, to gain knowledge from a data-set.

The IBM data science professional certificate program, which I have started taking around a month back, I found the data-visualization course as a part of 9 courses and this post is a run-through of some powerful techniques that I learnt in the course to elucidate data better.

Some useful plotting techniques and new inferences from data are shown here that are not used in the course itself.

You will find the codes in detail on my GitHub, where I have shared Jupyter notebook.

Rather than the codes, I will focus more on the plots and at times share snippets of code.

The data-set deals with immigration to Canada from various countries over the years 1980 to 2013.

Canada Immigration Data in .

xlsx format.

As the data-set is in .

xlsx format and there are 3 sheets in the file, below is a portion of notebook that guides you how to read this file into a data-frameskiprows is for taking care of the initial useless rows in the excel sheet.

It is better to rename the column (using pandas.

DataFrame.

rename) ‘OdName’ and ‘AreaName’ to ‘Country_Name’ and ‘Continents’ respectively, for better understanding, and inplace=True makes sure that the changes are saved in the data-frame.

If you don’t want to make this change in the original data-frame you should use inplace=False .

Bar Plots: DataFrame.

plot(kind=’bar’)Once we are set and done after few more tweaks, let’s plot using pandas DataFrame.

plot and first we will try some bar plots.

If we plot the number of immigrants from Haiti over the years then we can see a surprising rising trend of increasing immigrants near 2009, 2010, 2011.

Figure 1: Rising number of Haitian Immigrants near 2011.

Some of you have already guessed it right, due to catastrophic Haiti earthquake in 2010, number of immigrants sharply increased in 2011.

Let’s make a clear representation with awesome annotationFigure 2: Cool annotation added to Figure 1 to have a more purposeful plot.

A very similar trend can be seen with the number of immigrants from Iceland as the Icelandic financial crisis (2008–2011) led to severe economic depression.

Figure 3: Economic crisis in Iceland caused a severe increase in number of immigrants to Canada.

We can go on to see such trends for separate countries using bar plots but, let’s explore another way to visualize data, using pie plots.

Pie Plots: DataFrame.

plot(kind=’pie’)Pie plot is a kind of circular graphic and the slices in this circular plot represent numeric proportions.

Here we can see how the numeric proportions of immigrants from different continents varied over 20 years (1985 and 2005) using a pie plot.

However, effective representation is the issue.

Let’s see below the code and corresponding plotFigure 4: Pie plots of Immigrants from different continents in the year 1985 and 2005 are shown in left and right panel respectively.

As you can see these pie charts are visually not pleasing and even though we get a rough idea about how the percentage of immigrants from different continents varied over a span of 20 years, it is still not much self-explanatory.

Using the right keywords though can make the pie charts a whole lot better.

Figure 5: Same as figure 4 but this one is just too visually esthetic compared to the previous one.

In right panel shadow is added.

The code snippet used to plot the pie charts above is given below —Few of the important keywords that I learnt are autopct and pctdistance which make sure that percentages are shown up to 2 decimal places (%1.

3f will show float numbers up to 3 decimal places) and fix the text distance from the center of circle.

For making a title including sub-plots I have used matplotlib.

pyplot.

suptitle.

From the plots above you can see that in 1985 a significant portion of immigrants were from Europe, compared to 20 years later in 2005 it is completely dominated by Asia.

Actually in the early days the immigrants were mostly from British isles, later on India and China took over that spot.

Bubble Plots :Bubble plots are basically glorified scatter plots where 3 dimensions of data can be displayed in a 2D plot.

Apart from usual X and Y, the size of the bubble (or any other marker) represent another dimension (read feature).

To see an example of such plots, I have selected immigration information of India and China over the years 1980 to 2013.

We see a jump in numbers around 1997–1998, which could possibly be attributed to the Asian financial crisis.

Let’s see it belowFigure 6: Bubble plots of Immigrants from China and India to Canada over the years 1980 to 2013.

If you notice the star markers (representing immigrants from India) they got bigger and the color changed from purple to blue over the years.

Let’s see the code snippet below to understand what are exactly represented by the size and color of the markerIn plt.

scatter s and c represent the size and color of the the markers.

Particularly for this plot to represent Indian immigrants, I’ve made use of both these parameters.

Here s is the normalized value of immigrants over the years (multiplied by 2000 so that the marker sizes are big enough) and c, cmap represent just the raw number of immigrants.

So higher the number of immigrants is represented by blue and opposite with purple.

World Map : folium.

Map()One very gripping part of the course was using Folium library, which helps to create several types of interactive Leaflet maps.

We will see using the same immigration data-set, how well one can represent some crucial information in the World Map.

First we get started with installing Folium.

pip install Folium print folium.

__version__ # check the version >> 0.

7.

0The special kind of map we are interested in are called Choropleth.

It’s a kind of thematic map where the portion of the map is shaded/patterned according to the proportion of the statistical variable used.

Here I will plot how the number of Immigrants varied from all over the world in year 1985 and in year 2005.

So the pie plot we have seen before for continents, can be used to complement this Choropleth map.

Now to create a Choropleth map we need a .

json file containing the border coordinates of all countries and this file is provided by IBM, which you can get from my GitHub.

With this I used the code snippet below to plot a choropleth map of immigrants all over the world in 1985.

This maps are interactive so you can zoom in or out but here I just show a screenshot —In the code above, key_on relates to the country names in .

json file and for data the data-frame we are interested in is passed.

Figure 7: Immigration to Canada in the year 1985 from all over the worldFollowing the same procedure we create another Choropleth map representing immigrants to Canada from all over the world in 2005 and you can clearly see the difference.

Figure 8: Same as Figure 7 but now data from year 2005 is used to plot immigration to Canada.

One big drawback you can see in the plots above is that the United Kingdom color didn’t change from 1985 to 2005 even if we know from the data-frame that the number of immigrants were quite high in 1980s.

The problem is in the data-frame where the country name is ‘United Kingdom of Great Britain and Northern Ireland’, whereas in .

json file it is just United Kingdom.

So one can use replace option in ‘Country_Name’ column —Canada_df_world_map.

Country_Name = Canada_df_world_map.

Country_Name.

replace({"United Kingdom of Great Britain and Northern Ireland": "United Kingdom"})Canada_df_world_map.

tail(20)It is a crude replacement because UK and GRB + Northern Ireland isn’t the same, but we can just plot to verify our understanding and let’s check below.

Figure 9: Same as figure 7 but now the United Kingdom part is corrected.

Well, it is finally time to wrap up this post as it is getting longer, but I hope you get a pretty descent idea about the effective data visualization, i.

e.

how to tell stories with a beautiful presentation.

This post covers most of the fundamental techniques that I have learnt in the Data Visualization course offered by IBM in Coursera.

Just to review how good it is, I have to say that the labs, where one can directly play with all the techniques taught in lessons, are the most effective and fruitful part of the course.

At times the course material contain some printing mistakes but they are getting addressed slowly.

Learning to plot a waffle chart was great too but I leave that for students who took the course exclusively.

Overall, the experience was quite fun, especially for reviewing some of the fundamentals within a week, it is a great course!Discover more with the complete Jupyter notebook in my GitHub including the data-set used here.

Find me in LinkedIn and sometimes I post cool pics in national geographic.

.. More details

Leave a Reply