# Introduction to PyTorch BigGraph — with Examples

Introduction to PyTorch BigGraph — with ExamplesSven BalnojanBlockedUnblockFollowFollowingJun 21Network Photo by Alina Grubnyak on UnsplashPyTorch BigGraph is a tool to create and handle large graph embeddings for machine learning.

Currently there are two approaches in graph-based neural networks:Directly use the graph structure and feed it to a neural network.

The graph structure is then preserved at every layer.

graphCNNs use that approach, see for instance my post or this paper on that.

But most graphs are too large for that.

So it’s also reasonable to create a large embedding of the graph.

And then use it as features in a traditional neural network.

PyTorch BigGraph handles the second approach, and we will do so as well below.

Just for reference let’s talk about the size aspect for a second.

Graphs are usually encoded by their adjacency matrix.

If you have a graph with 3,000 nodes and an edge between each node, you end up with around 10,000,000 entries in your matrix.

Even if that’s sparse, apparently this bursts most GPUs according to the paper linked above.

If you think about the usual graphs used in recommendation systems, you’ll realise they are typically much larger than that.

Now there are already some excellent posts about the how and why of BigGraph, so I won’t spend more time on that.

I’m interested in applying BigGraph to my machine learning problem and for that I like to take the simplest examples and getting things to work.

I constructed two examples which we will walk through step by step.

The whole code is refactored and available at GitHub.

It’s adapted from the example found at the BigGraph repository.

The first example is part of the LiveJournal graph and the data looks like this:# FromNodeId ToNodeId0 10 20 3.

0 100 110 12.

0 461 0.

The second example are simply 8 nodes with edges:# FromNodeId ToNodeId0 10 20 30 41 01 21 31 42 12 32 43 13 23 43 74 15 16 27 3Embedding a Part of LiveJournals GraphBigGraph is made to work around the memory limit of machines, so it’s completely file based.

You’ll have to trigger processes to create the appropriate file structure.

And if you want run an example again, you’ll have to delete the checkpoints.

We also have to split into train and test beforehand, again on file basis.

The file format is TSV, tab separated values.

Let’s dive right into it.

The first code snippet declares two helper functions, take from the BigGraph source, sets some constants and runs the file split.

helper functions and random_split_file call.

This splits the edges into a test and train set by creating the two files data/example_1/test.

txt and train.

txt.

Next we use BigGraphs converters to create the file based structure for our datasets.

We will “partition” into 1 partition.

For that we already need parts of the config file.

Here’s the relevant part of the config file, the I/O data part and the graph structure.

entities_base = 'data/example_1' def get_torchbiggraph_config(): config = dict( # I/O data entity_path=entities_base, edge_paths=[], checkpoint_path='model/example_1', # Graph structure entities={ 'user_id': {'num_partitions': 1}, }, relations=[{ 'name': 'follow', 'lhs': 'user_id', 'rhs': 'user_id', 'operator': 'none', }],.

This tells BigGraph where to find our data and how to interpret our tab separated values.

With this config we can run the next Python snippet.

convert data to _partitioned data.

The results should be a bunch of new files in the data dir, namely:two folders test_partitioned, train_partitionedone file per folder for the edges in h5 format for quick partial readsthe dictionary.

json file containing the mapping between “user_ids” and new assigned ids.

entity_count_user_id_0.

txt contains the entity count, in this case 47.

The dictionary.

json is important to later map results of the BigGraph model to the actual embedding we want to have.

Enough preparation, let’s train the embedding.

Take a look at the config_1.

py, it contains three relevant sections.

# Scoring model – the embedding size dimension=1024, global_emb=False, # Training – the epochs to train and the learning rate num_epochs=10, lr=0.

001, # Misc – not important hogwild_delay=2, ) return configTo train we run the following Python code.

train the embedding.

We can evaluate the model based on some preinstalled metrics on our test set via this code piece.

evaluate the embedding.

Now let’s try to retrieve the actual embedding.

Again as everything is file based, it should now be located as h5 in the models/ folder.

We can load the embedding of user 0 by looking up his mapping in the dictionary like so:output the embedding.

Now let’s switch to our second example, a constructed one on which we hopefully can do something partially useful.

The liveJournal data is simply too huge to run through in a reasonable amount of time.

Link Prediction and Ranking on a Constructed ExampleAlright, we will repeat the steps for the second example, except we will produce an embedding of dimension 10, so we can view it and work with it.

Besides dimension 10 seems to me more than enough for 8 vertices.

We set upthose things in the config_2.

py.

entities_base = 'data/example_2' def get_torchbiggraph_config(): config = dict( # I/O data entity_path=entities_base, edge_paths=[], checkpoint_path='model/example_2', # Graph structure entities={ 'user_id': {'num_partitions': 1}, }, relations=[{ 'name': 'follow', 'lhs': 'user_id', 'rhs': 'user_id', 'operator': 'none', }], # Scoring model dimension=10, global_emb=False, # Training num_epochs=10, lr=0.

001, # Misc hogwild_delay=2, ) return configThen we run the same code as before but in one go, taking care of different file paths and format.

In this case we only have 3 lines of comments on top of the datafile:As final output you should get a bunch of things and in particular all embeddings.

Let’s do some basic tasks with the embedding.

Of course we could now use it and load it into any framework we like, keras, tensorflow, but BigGraph already brings some implementations for common tasks like link prediction and ranking.

So let’s try them out.

We predict the scores for the entities 0-7 and for 0–1 as we know from our data that 0–1 should be much more likely.

As comparator we loaded the “DotComparator” which computes the dot product or scalar product of the two 10-dimensional vectors.

Turns out the outputted numbers are tiny, but at least score_2 is much higher than score_1 as we expected.

Finally as the last piece of code we can produce a ranking of similar items, which uses the same mechanism as before.

We use the scalar product to compute the distances of embeddings to all other entities and then rank them.

The top entities in this case are in orders 0, 1, 3, 7 … and if you look at the data that seems to be pretty much right.

More FunThis is the most basic examples I could come up with.

I did not ran the original examples on the freebase data or on the LiveJournal data, simply because they take quite some time to train.

You can find the code and references here:GitHub repository of PyTorch BigGraphGitHub repository with sample codehttps://arxiv.

org/pdf/1903.

12287.

pdf, A.

Leerer, et.

al.

(2019), PyTorch-BigGraph: a Large-scale Graph Embedding System.

https://arxiv.

org/abs/1609.

02907, T.

N.

Kipf, M.

Welling (2016), Semi-Supervised Classification with Graph Convolutional Networks.

Problems You Might EncounterI ran the code on my mac and encountered three issues:An error stating “lib*….

Reason: image not found: “ The solution is to install what’s missing e.

g.

with “brew install libomp”I then ran into an error “AttributeError: module ‘torch’ has no attribute ‘_six’”, which might simply be because of incompatible python & torch versions.

Anyway I move from python 3.

6 & torch 1.

1 => python 3.

7 & torch 1.

X and had my problem solved.

Inspect the train.

txt and test.

txt before you move on, I saw some missing new lines there while testing.

Hope this helps and is fun to play with!.