Visualising Machine Learning Datasets with Google’s FACETS.

Visualising Machine Learning Datasets with Google’s FACETS.

An open source tool from Google to easily learn patterns from large amounts of dataParul PandeyBlockedUnblockFollowFollowingJan 28More data beats clever algorithms, but better data beats more data : Peter NorwigThere has been a lot of uproar as to how a large quantity of training data can have a tremendous impact on the results of a machine learning model.

However, along with data quantity, it is also the quality which is critical to building a powerful and robust ML system.

After all ‘GARBAGE IN: GARBAGE OUT’ i.

e what you get from the system will be a representation of what you feed into the system.

A Machine Learning dataset sometimes consists of data points ranging from thousands to millions which in turn may contain hundreds or thousands of features.

Additionally, real-world data is messy comprising of missing values, unbalanced data, outliers etc.

Therefore it becomes imperative that we clean the data before proceeding with model building.

Visualising the data can help in locating these irregularities and pointing out the locations where the data actually needs cleaning.

Data Visualisation gives an overview of the entire data irrespective of its quantity and helps to perform EDA in a fast and accurate manner.

FACETSThe dictionary meaning of facets boils down to a particular aspect or feature of something.

In the same way, the FACETS tool helps to understand the various features of data and explore them without having to explicitly code.

Facets is an open-source visualisation tool released by Google under the PAIR(People + AI Research) initiative.

This tool helps us to understand and analyse the Machine Learning datasets.

Facets consist of two visualisations, both of which help to drill down the data and provide great insights without much of work at user’s end.

Facets OverviewAs the name suggests, this visualisation gives an overview of the entire dataset and gives a sense of the shape of each feature of the data.

Facets Overview summarizes statistics for each feature and compares the training and test datasets.

Facets DiveThis feature helps the user to dive deep into the individual feature/observation of the data to get more information.

It helps in interactively exploring large numbers of data points at once.

These visualizations are implemented as Polymer web components, backed by Typescript code and can be easily embedded into Jupyter notebooks or web pages.

Usage & InstallationThere are two ways in which FACETS can be used with data:Web AppIt can be used directly from its demo page whose link is embedded below.

Facets – Visualizations for ML datasetsDive provides an interactive interface for exploring the relationship between data points across all of the different…pair-code.

github.

ioThis website allows anyone to visualize their own datasets directly in the browser without the need for any software installation or setup, without the data ever leaving your computer.

Within Jupyter Notebooks/ColaboratoryIt is also possible to use FACETS within Jupyter Notebook/Colaboratoty.

This gives more flexibility since the entire EDA and modelling can be done in a single notebook.

Please refer their Github Repository for complete details on installation.

However later in the article, we will see how to get going with FACETS in colab.

DataAlthough you can work with data provided on the demo page, I shall be working with another set of data.

I will be doing EDA with FACETS on the Load Prediction Dataset.

The problem statement is to predict whether an applicant who has been granted a loan by a company, will repay it back or not.

It is a fairly known example in the ML community.

The dataset which has already been divided into Training and Testing set can be accessed from here.

Let’s load in our data into the Colab.

import pandas as pdtrain = pd.

read_csv('train.

csv')test = pd.

read_csv('test.

csv')Now lets us understand how we can use Facets Overview with this data.

FACETS OverviewThe Overview automatically gives a quick understanding of the distribution of values across the various features of the data.

The distribution can also be compared across the training and testing datasets instantly.

If some anomaly exists in the data, it just pops out from the data there and then.

Some of the information that can be easily accessed through this feature are:Statistics like mean, median and Standard DeviationMin and Max values of a columnMissing dataValues that have zero valuesSince it is possible to view the distributions across test dataset also, we can easily confirm if the training and testing data follow the same distributions.

One would argue that we can achieve these tasks easily with Pandas and why should we invest into another tool.

This is true and maybe not required when we have few data points with minimum features.

However, the scenario changes when we are talking about a large dataset where it becomes kind of difficult to analyse each and every data point in multiple columns.

Google Colaboaratory makes it very easy to work since we do not need to install additional things.

By writing a few lines of code our work gets done.

# Clone the facets github repo to get access to the python feature stats generation code!git clone https://github.

com/pair-code/facets.

gitTo calculate the feature statistics, we need to use the function GenericFeatureStatisticsGenerator() which lies in a Python Script.

# Add the path to the feature stats generation code.

import syssys.

path.

insert(0, '/content/facets/facets_overview/python/')# Create the feature stats for the datasets and stringify it.

import base64from generic_feature_statistics_generator import GenericFeatureStatisticsGeneratorgfsg = GenericFeatureStatisticsGenerator()proto = gfsg.

ProtoFromDataFrames([{'name': 'train', 'table': train}, {'name': 'test', 'table': test}])protostr = base64.

b64encode(proto.

SerializeToString()).

decode("utf-8")Now with the following lines of code, we can easily display the visualisation right in our notebook.

# Display the facets overview visualization for this datafrom IPython.

core.

display import display, HTMLHTML_TEMPLATE = """<link rel="import" href="https://raw.

githubusercontent.

com/PAIR-code/facets/master/facets-dist/facets-jupyter.

html" > <facets-overview id="elem"></facets-overview> <script> document.

querySelector("#elem").

protoInput = "{protostr}"; </script>"""html = HTML_TEMPLATE.

format(protostr=protostr)display(HTML(html))As soon as you type Shift+Enter, you are welcomed by this nice interactive visualisation:Here, we see the Facets Overview visualization of the five numeric features of the Loan Prediction dataset.

The features are sorted by non-uniformity, with the feature with the most non-uniform distribution at the top.

Numbers in red indicate possible trouble spots, in this case, numeric features with a high percentage of values set to 0.

The histograms at right allow you to compare the distributions between the training data (blue) and test data (orange).

The above visualisation shows one of the eight categorical features of the dataset.

The features are sorted by distribution distance, with the feature with the biggest skew between the training (blue) and test (orange) datasets at the top.

FACETS DiveFacets Dive provides an easy-to-customize, intuitive interface for exploring the relationship between the data points across the different features of a dataset.

With Facets Dive, you control the position, colour and visual representation of each data point based on its feature values.

If the data points have images associated with them, the images can be used as the visual representations.

To use the Dive visualisation, the data has to be transformed into JSON format.

# Display the Dive visualization for the training data.

from IPython.

core.

display import display, HTMLjsonstr = train.

to_json(orient='records')HTML_TEMPLATE = """<link rel="import" href="https://raw.

githubusercontent.

com/PAIR-code/facets/master/facets-dist/facets-jupyter.

html"> <facets-dive id="elem" height="600"></facets-dive> <script> var data = {jsonstr}; document.

querySelector("#elem").

data = data; </script>"""html = HTML_TEMPLATE.

format(jsonstr=jsonstr)display(HTML(html))After you run the code, you should be able to see this:Facets Dive VisualisationNow we can easily perform Univariate and Bivariate Analysis and let us see some of the results obtained:Univariate AnalysisHere we will look at the target variable, i.

e.

, Loan_Status and other categorical features like gender, Marital Status, Employment status and Credit history, independently.

Likewise, you can play around with other features also.

InferencesLoan:Most of the applicants in the dataset are male.

Again a majority of the applicants in the dataset are married and have repaid their debts.

Also, most of the applicants have no dependents and are graduates from semi-urban areas.

Now let’s visualize the ordinal variables i.

e Dependents, Education and Property Area.

Following inferences can be made from the above bar plots:Most of the applicants don’t have any dependents.

Most of the applicants are Graduate.

Most of the applicants are from Semiurban area.

Now you can continue your analysis with the numerical data.

Bivariate AnalysisWe will find the relationship between the target variable and categorical independent variables.

It can be inferred from the above bar plots that:The proportion of married applicants is higher for the approved loans.

Distribution of applicants with 1 or 3+ dependents is similar across both the categories of Loan_Status.

It seems people with credit history as 1 are more likely to get their loans approved.

The proportion of loans getting approved in the semiurban area is higher as compared to that in rural or urban areas.

ConclusionFACETS provides an easy and intuitive environment to perform EDA for datasets and helps us derive meaningful results.

The only catch is that currently it only works with Chrome.

Before ending this article, let us also see a fun fact highlighting how a small human labelling error in CIFAR-10 dataset was caught using the FACETS Dive.

While analysing the dataset it came to notice that an image of a frog had been incorrectly labelled as a cat.

Well, this is indeed some achievement since it would be an impossible task for a human eye.

Source.. More details

Leave a Reply