Visualizing the Scimago journal ranking database with Pandas

Visualizing the Scimago journal ranking database with PandasSaurav JhaBlockedUnblockFollowFollowingFeb 22SCImago Journal Rank (SJR indicator) is a measure of scientific influence of scholarly journals that accounts for both the number of citations received by a journal and the importance or prestige of the journals where such citations come from.

The website for scimago journal maintains thorough information on journal and country rankings.

Unlike other similar sites, it further offers some peculiar visualization tools providing information visualization graphs on the structure of various sciences, real time and bubble chart maps built upon a range of indicators such as H Index, Total Documents, Citable Documents, Citations, Self-Citations, Citations per Document and Countries.

In this blog post, I will be using Scimago’s database for creating similar visualizations alongside analyzing the curious trends that these visualizations shall unfold.

The csv data file is available for download here.

The libraries used are pandas, seaborn, matplotlib, and plotly.

For the impatient souls, code with outputs can be found in this jupyter notebook.

Loading the databaseI use pandas for loading the csv file.

The figure below shows various fields of the database:In the figure above, ‘title’ refers to the name of the publishing entity.

These can be of types journal, trade journal, book series or conference and proceedings.

Also, among listed are the SCImago Journal Ranks (SJR), country names hosting these, the publisher organisation, the categories they belong to alongside other metrics.

The figure on the left shows the number of columns and data types of the fields obtained by calling info() method on the data frame.

We can observe that the database is pretty clean with only a few columns (SJR, Country, Publisher) containing missing values.

Visualizing the H-indexH-index refers to the journal’s number of articles (h) that have received at least h citations.

With the elegance in its simplicity, H-index can be thought of as the most important metric of publication performance today.

Let us begin with visualizing the distribution of H-index values so as to infer all the possible occurring values and along with their respective frequencies.

We can use Seaborn’s distplot for this.

sns.

distplot(df.

loc[df['Country'] == 'United States']['H index'].

dropna(), kde=True)print(df.

loc[df['Country'] == 'United States']['H index'].

dropna())The figures below show the distribution for the United States and India.

The blue curves over the plots denote Gaussian kernel density estimates.

We can easily infer a few sharp differences between the two.

For instance, a majority of the Indian journals have H-index in the range 0–20 while those for the U.

S.

cover 0–100.

Also, none of the Indian journals exceed the H-index of 200.

Moving further, I tried visualizing the number of total cites of each country (in the previous 3 years) compared to their total H-indices using bubble chart.

For this, we need to impart sizes to the bubbles based on the total number of documents published by these countries in the last three years duration:plt.

figure(figsize=(10,8))sns.

set_context("notebook", font_scale=1.

1)sns.

set_style("ticks")sizes = [10, 60, 90, 130, 200] marker_size = pd.

cut(df['Total Docs.

(3years)']/10, [0, 40, 80, 120, 160, 200], labels=sizes)sns.

lmplot('H index', 'Total Cites (3years)', data ​=df, hue='Country', fit_reg=False, scatter_kws={'s':marker_size})plt.

title('H index vs Total Cites (3 years)')plt.

xlabel('H index')plt.

ylabel('Total Cites')plt.

ylim((0, 50000))plt.

xlim((0, 600))plt.

savefig('sns.

png')Due to the image size limitations of LinkedIn, I could not attach the entire image and have instead provided the google drive link.

From the chart, we can easily distinguish the countries producing the top and the least citable documents.

As expected, United States lies at the top, followed by the United Kingdom, Netherlands, Austria, and Germany while those at the bottom of the list are: Zimbabwe, Vatican city, Fiji and Liechtenstein.

I also used plotly to obtain a similar yet more interactive chart.

However, it was too heavy to be attached here as I experienced plotly’s page crashing multiple times while navigating through the interactive graph.

(Any suggestions on alternatives are welcome.

)hindex_vs_totalCites.

pngEdit descriptiondrive.

google.

comVisualizing the correlation amongst various metricsNext, we would like to learn the correlation patterns among the various metrics.

This is where Seaborn’s jointplot comes to rescue.

Using jointplot, we can display data points according to two variables, both their distributions, kernel density estimators, and an optional regression line fitting the data:The graph (last sub-plot) shows the number of citable documents being highly correlated to the total number of references with a pearson correlation coefficient (pearsonr) of 0.

78.

Also, the regression line for their graph shows a strong upward tendency which explains that as the total number of references grow, the number of citable documents also witness an increase.

Both these observations give worse results while evaluating H-index with respect to total documents, i.

e.

pearsonr = 0.

35, and a regression line with a very small upward tendency.

Further on correlation, we can use Seaborn’s heatmap to visualize the correlation among multiple metrics of the dataframe at once.

This can be achieved using corr() method called upon pandas dataframe which calculates correlation coefficients between all couples of numeric columns of the DataFrame using one of the specified methods (Pearson, by default).

One particular trend to be noticed is that the no.

of citable documents are nearly in perfect correlation with the total no.

of documents published.

However, neither of these metrics have a strong impact on the H-index.

The best possible metric affecting the H-index of a journal thus could be thought of as the total number of cites the journal has received.

The trend in above heat map when supported by the observations of the joint plots could suggest the intrinsic flaw in the calculation of H-index, as stated in Ramana et.

al.

(2013):Despite a large number of your papers being heavily cited (total cites, in our case), it considers only those papers that are cited to a minimum number as your h-index and for your h-index to improve, citations of your other papers should also increase.

Visualizing the countriesWe have been studying the relationships among pairs of various numeric metrics till now.

So far, that's all good.

But what if we want to know the trend in these metrics for the individual countries (or, for some third feature)?One solution to this is using pairplots.

Pairplots are important in case of exploring correlations between multidimensional data, when we might to plot all pairs of values against each other.

For my purpose, I chose the first 1000 instances of the CSV file while specifying the values of the field ‘Country’ in order to map plot aspects to different colors:sns.

pairplot(df[:1000], hue='Country', size=2.

5)We can now study the performance of the individual countries.

Here are some quick inferences:The countries lying in the pink spectrum (Bulgaria, Canada and India) can be seen to have lower H-index and citable documents, those in the blue-green spectrum have moderate while the orange-red spectrum denotes the highest performing countries.

The graphs along the diagonal can be seen to consist of bar plots so as to show the univariate distribution of the data for the respective variables in that column.

The perfect correlation that we observed between the no.

of citable documents and the total no.

of documents published holds here too as we can see that most of the data points (i.

e.

, countries) are concentrated along the diagonal line in the above graph (fifth row, last column).

This means that we can fit a smooth regression line along the diagonal in order to cover the majority of the data points.

Visualizing the categoriesWe might as well like to know more about the different category of sciences mentioned in the database.

However, the category cells host multiple semi-colons separated fields.

Therefore, I first replicated the rows based on the different ‘;’ separated values in the categories.

new_df = pd.

DataFrame(df.

Categories.

str.

split(';').

tolist(), index=df.

Country).

stack()new_df = new_df.

reset_index()[[0, 'Country']] # categories variable is currently labeled 0new_df.

columns = ['Categories', 'Country'] # renaming categoriesnew_df = new_df.

dropna()new_df.

info()new_df.

head()Now, let us observe the categories with the most and the least number of publishing entities.

For this, we need to convert the string values for the Categories into numeric.

Pandas has an inbuilt method for this purpose: value_counts() returns the count of unique values in a pandas series.

fig, ax = plt.

subplots()new_df['Categories'].

value_counts()[:20].

plot(ax=ax, kind='bar')As expected, computer sciences atop!Also, the aforementioned visualizations can be carried out with this new data frame as well.

.

. More details

Leave a Reply