Uncover Hidden Gems with Exploratory Data Analysis

It might be interesting to see where Walmart’s goods were coming from in 2013.

The describe function is very useful for EDA:xdata['COUNTRY OF ORIGIN'].

describe().

to_dict(){'count': 174093, 'unique': 71, 'top': 'China', 'freq': 125366}We see that the top country of origin is China, accounting for 72% of the shipments (by count).

However, there are 71 unique countries, so who else is selling to Walmart?Visualization of Walmart’s Global SuppliersWe could look at a histogram, but it would have more visual impact to see it on a map.

To do this, we can make a choropleth map, where different shades of colors for each country will show the percentage of shipments.

We will use Folium which uses Python and the Leaflet.

js library to create interactive maps for data visualization.

A Folium Choropleth map needs a data frame and two columns: one that is used to index into the JSON mapping data and the other that contains a value that is used to pick a shading color on the map.

Although I’m sure there are many more efficient ways of doing this, since this is exploratory data analysis, and the data set is not that large, we can use groupby, describe and xs to quickly make a dataframe with the columns needed to make a Folium map.

If you are having difficulty seeing how I ended up with the expression below, break it into parts and display each part.

I don’t really care about the ‘WEIGHT (KG)’ column right now, just that we can use its ‘count’ field.

countries_of_origin = xdata.

groupby('COUNTRY OF ORIGIN').

describe().

xs('WEIGHT (KG)', axis=1).

copy()countries_of_origin.

reset_index(level=0, inplace=True)coo_record_count = sum(countries_of_origin['count'])countries_of_origin['pct'] = countries_of_origin['count'].

apply(lambda cnt: 100*cnt/coo_record_count)countries_of_origin['logpct'] = countries_of_origin['count'].

apply(lambda cnt: np.

log(100*cnt/coo_record_count))# We will use the sorting latercountries_of_origin.

sort_values(by==['count'])The first three entries in countries_of_origin, by countNow that we have the data in a useful format, let’s make a map with Folium:# Initialize the map:m = folium.

Map(location=[30, 0], zoom_start=2, no_wrap=True)# geo_data is the map data# Add the color for the chloropleth:_ = folium.

Choropleth( geo_data=folium_world_data.

as_posix(), name='Choropleth', data=countries_of_origin, columns=['COUNTRY OF ORIGIN', 'pct'], key_on='feature.

properties.

name', fill_color='YlGn', fill_opacity=0.

7, line_opacity=0.

2, nan_fill_color='white', legend_name='Country of Origin (%)').

add_to(m) # Save to htmlmap_path = reports_path / 'walmart_folium_chloropleth_pct.

html'm.

save(map_path.

as_posix())mCountries of origin for Walmart shipments, by countThis isn’t very useful, because China dominates so thoroughly that everything else is close to zero.

The white areas are countries that are not shipping to Walmart at all (at least in this data).

The yellow areas ship to Walmart, but in single digit percentages.

There was a reason that I added two columns to countries_of_origin above.

Let’s plot the map again, but this time use the log of shipment percentage.

I will omit the code for brevity, but I just replaced the second element in the columns parameter with ‘logpct’.

Countries of origin for Walmart shipments, by log(count)This is more instructive, since it shows relative positioning amongst the runners-up.

Looking at this map made me curious about the countries that were almost zero but not quite.

Visually, Saudia Arabia stands out as a curiosity to me.

xdata[xdata['COUNTRY OF ORIGIN']=='Saudi Arabia']['PRODUCT DETAILS'].

describe().

to_dict(){'count': 3, 'unique': 3, 'top': '320 BAGS, 19,200 KG OF ETHIOPI A ARABICA COFFEE SIDAMO GRAD- 2 TOP QUALITY REF P2006A, PA CKAGE: JUTE BAGS OF 60 KGS EA CH DELIVERY TERMS: FOB DJIBOU TI NET SHIPPED WEIGHT FDA R EGISTRATION NO.

11826035706 S', 'freq': 1}Aha, it seems to be coffee transshipped from Ethiopia.

A glance at the other two items shows granite slabs and personal effects, not the expected pertroleum products.

One other thing of note is that the ‘PRODUCT DETAILS’ entry has some weird word break issues (which is not limited to this entry).

Looking at the tail end of the sorted dataframe, I see Korea, which I would imagine should have more than one shipment given how big they are in the electronics space.

countries_of_origin.

sort_values(by=['count'], axis=0, ascending=False).

tail(4)Sure enough, South Korea is number three by count, which we see if we replace tail(4) with head(3) above, which means that the lone shipment for Korea is a data entry error.

See also China Taiwan and Taiwan.

Some feedback to the owners of this data would be to require standard country names when inputting data.

Before we move away from the mapping section, I wanted to point out one oddity of Folium that took me a bit of time to figure out.

To set the key_on parameter of the choropleth function correctly, I looked at the Folium data file folium/examples/data/world-countries.

json to find the names in the data structure, giving me key_on=’feature.

properties.

name’.

Weighty MattersNow that we have identified which countries ship the largest number of items, let’s look at the countries that ship the heaviest items.

The Pandas Grouping Cookbook was very helpful in figuring out the expressions below.

top_by_weight = countries_of_origin.

sort_values(by=['max'], axis=0, ascending=False)[0:10]['COUNTRY OF ORIGIN'].

valuestw3 = xdata.

groupby(['COUNTRY OF ORIGIN']).

apply( lambda subf: subf.

sort_values('WEIGHT (KG)', ascending=False).

head(3))cols_of_interest = ['COUNTRY OF ORIGIN', 'WEIGHT (KG)', 'PRODUCT DETAILS', 'ARRIVAL DATE']# https://www.

wolframalpha.

com/input/?i=3e%2B07+kg# ≈ (0.

7 to 1) × mass of a Handy size cargo shipmass_handy_size = 3.

0e+07 #kgtw3.

loc[(tw3['COUNTRY OF ORIGIN'].

isin(top_by_weight)) & (tw3['WEIGHT (KG)']>mass_handy_size)][cols_of_interest]Recall that countries_of_origin was constructed using groupby and ‘WEIGHT (KG)’, so max here refers to the weight.

I have to admit that I was quite surprised at the countries that showed up at the top, but even more surprised when I saw what those products were.

The shipment from Venezuela was the heaviest, by two orders of magnitude.

Because the dataframe display truncates the ‘PRODUCT DETAILS’ column, lets make sure we are not talking about olive oil here…list(xdata[xdata['WEIGHT (KG)']>1e+09]['PRODUCT DETAILS'])[0]'5,550.

023 METRIC TONS (49,849 BBLS) LIGHT VIRGIN NAPHTHA'Petroleum naphtha is an intermediate hydrocarbon liquid stream derived from the refining of crude oil.

Looking through the product details, we see that the heaviest shipments are petroleum products (ALKYLATE, GASOLINE, NAPTHA), building materials (ROCK and CEMENT), and GRANULAR UREA (!?).

Beats the piss out of me what they do with that…The hidden gems here for me were the various unexpected countries that supplied petroleum products, and that one of the expected oil exporters (Kuwait) shipped UREA instead.

One other interesting factoid, courtesy of Wolfram Alpha, is that the NAPHTHA shipment from Venezuela weighs almost as much as the Great Pyramid of Giza.

Pack It InSince we have the fields ‘WEIGHT (KG)’ and ‘CONTAINER COUNT’ we can figure out what percentage of shipments would fit in standard shipping containers.

Given that the ‘WEIGHT (KG)’ covers almost 10 orders of magnitude, and also includes some zero values, we will find it useful to add a column, ‘log_weight_kg’.

xdata['log_weight_kg'] = xdata['WEIGHT (KG)'].

apply(lambda x: np.

log(max(1, x)))With xdata[‘WEIGHT (KG)’].

describe(), we see that the mean weight of a shipment is 48815.

86 Kg, which Wolfram Alpha helpfully says is about half of the cargo mass capacity of a Boeing 747–200F aircraft.

Plotting a histogram of the ‘log_weight_kg’ column gives us:where the red dashed line is the median.

Next, let us look at shipments that could potentially be shipped in a standard 20 or 40 foot shipping container, by weight.

Here I am going to make the assumption that ‘WEIGHT (KG)’ is the net weight of the shipment.

Add a ‘kg_per_container’ to the dataset, where we normalize missing values so that we have at least one container for each shipment.

xdata['kg_per_container'] = xdata.

apply(calc_kg_per_container, axis = 1)containerizable = xdata[xdata['kg_per_container'] <= NW_KG_DRY_CONTAINER_40FT]We consider it ‘containerizable’ if it meets the weight requirement for a 40 ft container, and a simple calculation shows that this is true for 99.

6% of the shipments.

By narrowing our attention to this vast majority of shipments, setting aside the bulky outliers, we can get a more fine-grained picture of these smaller shipments.

The weight per container basically steadily declines, apart from a hill at around 19000 Kg/container.

This might be an interesting area for further research.

Another DimensionTo highlight the importance of creating a data dictionary for each dataset, we now consider the ‘M.

UNIT’ column.

I am assuming that ‘M.

UNIT’ is the units for the ‘MEASUREMENT’ column (if so, this relationship should be recorded in the data dictionary).

xdata['M.

UNIT'].

value_counts().

to_dict(){'CM': 93812, 'X': 47585, 'CF': 321, 'M': 128, 'F': 24, 'CC': 11, 'MM': 4, 'SS': 3, 'XX': 3, 'FF': 1}We can take a guess that ‘CM’ means cubic meters, ‘CF’ is cubic feet, but what about the rest?.To harp on my original point, we should not have to guess; these should be given in the data dictionary.

Since it appears we must guess, let us start out with the less common measurements.

low_units = ['F', 'CC', 'MM', 'SS', 'XX', 'FF']cols_of_interest = ['COUNTRY OF ORIGIN', 'WEIGHT (KG)', 'QUANTITY', 'CONTAINER COUNT', 'MEASUREMENT', 'M.

UNIT', 'PRODUCT DETAILS'] # , 'ARRIVAL DATE'pd.

set_option('display.

max_colwidth', 125)pd.

set_option('display.

max_rows', 50)xdata[xdata['M.

UNIT'].

isin(low_units)][cols_of_interest]The first few entries of the less common M.

UNIT collectionFor the six less common ‘M.

UNIT’ values in low_units, the only pattern I could readily discern is that if ‘M.

UNIT’ is either ‘F’, ‘FF or ‘MM’’, then ‘MEASUREMENT’ is zero, so perhaps these are ignorable.

The others will have to remain a mystery until we can talk with the data owners.

We can get a quick overview of the distribution over all the units using seaborn to make a stripplot.

We limit ourselves to a subset of the data that might fit in a 40 ft container.

xic = xdata[(xdata['CONTAINER COUNT']>0) & (xdata['WEIGHT (KG)']>0) & (xdata['kg_per_container']<=NW_KG_DRY_CONTAINER_40FT)]g = sns.

stripplot(x='MEASUREMENT', y='M.

UNIT', data=xic)sns.

despine() # remove the top and right line in graphg.

figure.

set_size_inches(6,4)plt_path = figures_path / 'measure_stripplot.

png'plt.

savefig(plt_path.

as_posix(), format='png'Distribution of MEASUREMENT values by M.

UNITWe will look at the distribution of ‘CM’ in more detail below.

Now let us turn to the most common ‘M.

UNIT’, ‘CM’.

Using describe(), we know that there are very large values and zero values for ‘MEASUREMENT’ if ‘M.

UNIT’ == ‘CM’, so we take a subset:xdata_cm = xdata[(xdata['M.

UNIT'] == 'CM') & (xdata['MEASUREMENT'] > 0.

0)].

copy()xdata_cm['logmeasure'] = xdata_cm['MEASUREMENT'].

apply(lambda mm: np.

log10(mm))We can use the seaborn package on this subset to give us a nice boxplot, which is a different view from what we saw with stripplot above.

max_logmeasure = xdata_cm.

logmeasure.

max()max_logmeasure_shipment_id = xdata_cm['logmeasure'].

idxmax()g = sns.

boxplot(x = xdata_cm['logmeasure'])# remove the top and right line in graphsns.

despine()_ = plt.

annotate(s = max_logmeasure_shipment_id, xy = (max_logmeasure,0), xytext=(0.

85, 0.

65), textcoords='axes fraction', # bottom, left # Shrink the arrow to avoid occlusion arrowprops = {'facecolor':'gray', 'width': 1, 'shrink': 0.

09, 'headlength':9}, backgroundcolor = 'white')g.

figure.

set_size_inches(6,4)# plt.

show()plt_path = figures_path / 'logmeasure_box.

png'plt.

savefig(plt_path.

as_posix(), format='png')The annotation we added shows the row id of this extreme outlier.

We can also see from the plot and describe() that the interquartile range is 58 (from 7 to 65), there are a bunch of outliers from roughly 2000–50000, then a few above that.

cols_of_interest = ['SHIPPER', 'WEIGHT (KG)', 'QUANTITY', 'MEASUREMENT', 'CONTAINER COUNT', 'PRODUCT DETAILS']xdata_cm.

sort_values(by='logmeasure', ascending=False)[cols_of_interest].

head(10)The largest values of log(MEASUREMENT)The top entry, our extreme outlier, is very suspicious looking.

If ‘MEASUREMENT’ is in cubic meters, this would be a volume equivalent to about 10% of annual global oil shipments.

Even more unbelievable is the ‘QUANTITY’ field.

If correct, this shipment alone would supply a 40' [sic] HD LCD TV to one out of three people on Earth.

I just don’t believe it.

Moving down the table to ‘Meyer Industries’, we come up with a weight per unit of 13.

2 Kg, which is plausible for a set of aluminum cookware.

I still can’t make the ‘MEASUREMENT’ field work however.

From a weight perspective, this could have been shipped in as few as six 40 ft containers, less than the 16 reported.

However, sixteen 40 ft containers only have a capacity of 1083 cubic meters, far less than the 29K m³ in the ‘MEASUREMENT’ field.

Is there any hope that our guess is correct that ‘M.

UNIT’ of ‘CM’ is cubic meters?.Let’s take a look at some values from the third quantile.

xdata_cm_sm = xdata_cm[xdata_cm['MEASUREMENT'] < CAPACITY_CONTAINER_40FT]xdata_cm_sm.

sort_values(by='logmeasure', ascending=False)[cols_of_interest].

head(10)Measurement should indicate that it would fit in a 40ft containerBy sheer coincidence, the top item in this list is also cookware, with a weight per carton of 13.

2 Kg, the same as above.

A quick check for an 18 piece cookware set on Amazon gives us one with product dimensions of 11.

5 x 22.

8 x 13 inches (3409 in³ or 0.

05586 m³) and a shipping weight of 21.

9 pounds.

This is a bit lighter than the cookware in the shipment here, and if we scale up the volume by the ratio of the weight differences, we get 0.

074225 m³.

All 942 sets, perfectly packed, would then need a volume of about 69.

92 m³, which is in the ballpark (4.

3% over) of the ‘MEASUREMENT’ value of 67.

We could look at some more examples, but there is some evidence that ‘CM’ means cubic meters.

A more important take-away from this section is that this data set appears to need a lot of cleaning up before it could be used for predictions.

Fraud checkIn our exploration of the heaviest shipments, the items from the Bahamas surprised me — I think of blue water and sandy beaches when I think of the Bahamas, not oil exports.

The description for the heaviest Bahamian export is “1 BULK 277328.

140 BARRELS ALKYLATE”.

Alkylate is a high-octane blending component for motor gasoline.

For the sake of argument, let’s suppose that the auditing team wanted to check up on this for possible fraud.

It might not seem like there enough data to be useful, but let’s see if we can help the audit team change this from a yellow flag (questionable) to either a red flag (fraud) or green flag (normal transaction.

)We will leave fields like ‘SHIPPER’ and ‘CONSIGNEE ADDRESS’ to our crack team of private investigators while we look at something a bit more esoteric.

One important thing to point out: I am not a subject matter expert on petroleum shipments, so I am making a lot of assumptions here.

As a side note, I became interesting in this entry when I was looking at weight per container; ‘CONTAINER COUNT’ is 1 but the weight is massive so it messed up my nice histograms.

From ‘PRODUCT DETAILS’ we have the quantity: 277328.

140 BARRELS.

The density of alkylate depends on temperature.

If we assume that density varies linearly in the 15–38 °C range, we can derive a formula for the temperature given weight and density.

Photo by Manki Kim on UnsplashWe have the number of barrels, and the fact that a barrel is 42 U.

S.

standard gallons which after conversions gives us the number of liters.

We get the weight from the ‘WEIGHT (KG)’ column and dividing gives us the density.

Plugging that into our linear formula, we find the temperature was 21.

24 °C.

(These calculations are spelled out in the Jupyter notebook).

The ‘ARRIVAL DATE’ was 01/31/2013.

A Handysize ship travels at about 14 knots; using this site we see that travel time is 3.

2 days from Nassau, Bahamas to New York, and the weather on 01/27/2013 had a low of 21°C and a high of 27 °C in Nassau.

So the weight and product details seem plausible.

To finalize this, one would have to look at the contract.

One agreement (not involving Walmart) that I found in SEC archives has this clause:“Barrel” — 42 U.

S.

standard gallons measured at 60 degrees Fahrenheit [15.

556 °C].

If we suppose that this same temperature applied here, how far off would we be?.Simple algebra shows us that they shipped 0.

7% more than necessary if this were the case.

I think we can tell the auditors to mark this with a green flag.

Final ThoughtsIf you imagine your data as a multi-dimensional spiky ball:Wolfram Mathematica Version 12 Spikeyyou can turn it this way and that and look at which spikes stick out the most (the outliers).

After that, if you hack off the spikes, you can look at the shape of your more typical data.

We have done that in this article by looking at count and weight outliers.

If you are publishing a data set, either internally or externally, always include a data dictionary even if the values seem obvious.

There are several good resources in the reference section below.

It is surprising what one can do with many datasets, even if they appear to have incomplete or limited data.

By using data from different columns, scanning some text fields, and using external sources, one can often tease out some interesting tidbits.

Thank you for joining me on this rambling journey through this interesting data set.

I hope that the techniques I outlined here will be helpful to you as you explore your own data sets.

References and Further ReadingA full set of links is given in the source repository.

The two Exploratory Data Analysis references both use the R language, but are worth looking at even if you just use Python.

How to Make a Data Dictionary, Open Science FrameworkData Dictionary: a how to and best practices, Carl AndersonData Dictionaries, U.

S.

Geological SurveyExploratory Data Analysis with R, Roger D.

PengExploratory Data Analysis, Johns Hopkins University via Coursera3 Awesome Visualization Techniques for every dataset, Rahul AgarwalGeoJSON and choropleth, Felipe [ocefpaf], Frank Conengmo, Joshua Cano, FloChehab.

. More details

Leave a Reply