“A social network of a karate club was studied by Wayne W.
Zachary for a period of three years from 1970 to 1972.
The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club.
During the study, a conflict arose between the administrator “John A” and instructor “Mr.
Hi” (pseudonyms), which led to the split of the club into two.
Half of the members formed a new club around Mr.
Hi; members from the other part found a new instructor or gave up karate.
Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.
”Basic graph notionsA graph G=(V, E) is made of a set of :Nodes also called verticles, V=1,…,nEdges E⊆V×VAn edge (i,j) ∈ E links nodes i and j.
i and j are said to be neighbors.
A degree of a node is its number of neighborsIllustration of nodes, edges, and degreesA graph is complete is all nodes have n−1 neighbors, i.
e all nodes are connected in every possible way.
A path from i to j is a sequence of edges that goes from i to j.
This path has a length equal to the number of edges it goes through.
The diameter of a graph is the length of the longest path among all the shortest path that link any two nodes.
For example, in this case, we can compute some of the shortest paths to link any two nodes.
The diameter would typically be 3 since the is no pair of nodes such that the shortest way to link them is longer than 3.
Diameter of 3The shortest path between two nodes is called the geodesic path.
If all the nodes can be reached from each other by a given path, they form a connected componentA graph is connected is it has a single connected componentFor example, here is a graph with 2 different connected components :2 connected componentsA graph is directed if edges are ordered pairs.
In this case, the “in-degree” of i is the number of incoming edges to i, and the “out-degree” is the number of outgoing edges from i.
Directed GraphA graph is cyclic if there are paths through relationships and nodes where you walk from and back to a particular node.
A graph is weighted if we assign weights to either nodes or relationships.
A graph is sparse if the number of relationships is large compared to nodes.
To summarize :Summary (Neo4J Graph Book)Let’s now see how to retrieve this information from a graph in Python :n=34G_karate.
degree()The attribute .
degree() returns the list of the number of degrees (neighbors) for each node of the graph :DegreeView({0: 16, 1: 9, 2: 10, 3: 6, 4: 3, 5: 4, 6: 4, 7: 4, 8: 5, 9: 2, 10: 3, 11: 1, 12: 2, 13: 5, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 3, 20: 2, 21: 2, 22: 2, 23: 5, 24: 3, 25: 3, 26: 2, 27: 4, 28: 3, 29: 4, 30: 4, 31: 6, 32: 12, 33: 17})Then, isolate the values of the degrees :# Isolate the sequence of degreesdegree_sequence = list(G_karate.
degree())Compute the number of edges, but also metrics on the degree sequence :nb_nodes = nnb_arr = len(G_karate.
edges())avg_degree = np.
mean(np.
array(degree_sequence)[:,1])med_degree = np.
median(np.
array(degree_sequence)[:,1])max_degree = max(np.
array(degree_sequence)[:,1])min_degree = np.
min(np.
array(degree_sequence)[:,1])Finally, print all this information :print("Number of nodes : " + str(nb_nodes))print("Number of edges : " + str(nb_arr))print("Maximum degree : " + str(max_degree))print("Minimum degree : " + str(min_degree))print("Average degree : " + str(avg_degree))print("Median degree : " + str(med_degree))This heads :Number of nodes : 34Number of edges : 78Maximum degree : 17Minimum degree : 1Average degree : 4.
588235294117647Median degree : 3.
0On average, each person in the graph is connected to 4.
6 persons.
We can also plot the histogram of the degrees :degree_freq = np.
array(nx.
degree_histogram(G_karate)).
astype('float')plt.
figure(figsize=(12, 8))plt.
stem(degree_freq)plt.
ylabel("Frequence")plt.
xlabel("Degre")plt.
show()Degree HistogramWe will, later on, see that the histograms of degrees are quite important to determine the kind of graph we are looking at.
II.
How is a graph stored?You might now wonder how we can store complex graph structures?There are 3 ways to store graphs, depending on the usage we want to make of it :In an edge list :1 21 31 42 33 4.
We store the ID of each pair of nodes linked by an edge.
Using the adjacency matrix, usually loaded in memory :Adjacency matrixFor each possible pair in the graph, set it to 1 if the 2 nodes are linked by an edge.
A is symmetric if the graph is undirected.
Using adjacency lists :1 : [2,3, 4]2 : [1,3]3: [2, 4].
The best representation will depend on the usage and available memory.
Graphs can usually be stored as .
txt files.
Some extensions of graphs might include :weighted edgeslabel on nodes/edgesfeature vectors associated with nodes/edgesIII.
Types of GraphsIn this section, we’ll cover the two major types of graphs :Erdos-RényiBarabasi-Albert1.
Erdos-Rényi modela.
DefinitionIn an Erdos-Rényi model, we build a random graph model with n nodes.
The graph is generated by drawing an edge between a pair of nodes (i,j) independently with probability p.
We therefore have 2 parameters : the number of nodes : n and the probability : p.
Erdos-Rényi GraphsIn Python, the networkx package has a built-in function to generate Erdos-Rényi graphs.
# Generate the graphn = 50p = 0.
2G_erdos = nx.
erdos_renyi_graph(n,p, seed =100)# Plot the graphplt.
figure(figsize=(12,8))nx.
draw(G_erdos, node_size=10)You’ll get a result pretty similar to this one :Generated Graphb.
Degree distributionLet pk the probability that a randomly selected node has a degree k.
Due to the random way the graphs are built, the distribution of the degrees of the graph is binomial :Binomial node degree distributionThe distribution of the number of degrees per node should be really close to the mean.
The probability to observe a high number of nodes decreases exponentially.
degree_freq = np.
array(nx.
degree_histogram(G_erdos)).
astype('float')plt.
figure(figsize=(12, 8))plt.
stem(degree_freq)plt.
ylabel("Frequence")plt.
xlabel("Degree")plt.
show()To visualize the distribution, I have increased n to 200 in the generated graph.
Degree Distributionc.
Descriptive statisticsThe average degree is given by n×p.
With p=0.
2 and n=200, we are centered around 40.
The degree expectation is given by (n−1)×pThe maximum degree is concentrated around the averageLet’s retrieve those values in Python :# Get the list of the degreesdegree_sequence_erdos = list(G_erdos.
degree())nb_nodes = nnb_arr = len(G_erdos.
edges())avg_degree = np.
mean(np.
array(degree_sequence_erdos)[:,1])med_degree = np.
median(np.
array(degree_sequence_erdos)[:,1])max_degree = max(np.
array(degree_sequence_erdos)[:,1])min_degree = np.
min(np.
array(degree_sequence_erdos)[:,1])esp_degree = (n-1)*pprint("Number of nodes : " + str(nb_nodes))print("Number of edges : " + str(nb_arr))print("Maximum degree : " + str(max_degree))print("Minimum degree : " + str(min_degree))print("Average degree : " + str(avg_degree))print("Expected degree : " + str(esp_degree))print("Median degree : " + str(med_degree))This should give you something similar to :Number of nodes : 200Number of edges : 3949Maximum degree : 56Minimum degree : 25Average degree : 39.
49Expected degree : 39.
800000000000004Median degree : 39.
5The average and the expected degrees are really close since there is only a small factor between the two.
2.
Barabasi-Albert modela.
DefinitionIn a Barabasi-Albert model, we build a random graph model with n nodes.
The graph is generated by the following algorithm :Step 1: With a probability p, move to the second step.
Else, move to the third step.
Step 2: Connect a new node to existing nodes chosen uniformly at randomStep 3: Connect the new node to n existing nodes with a probability proportional to their degreeThe aim of such graph is to model preferential attachment, which is often observed in real networks.
In Python, the networkx package has also a built-in function to generate Barabasi-Albert graphs.
# Generate the graphn = 150m = 3G_barabasi = nx.
barabasi_albert_graph(n,m)# Plot the graphplt.
figure(figsize=(12,8))nx.
draw(G_barabasi, node_size=10)You’ll get a result pretty similar to this one :Barabasi-Albert GraphYou can easily notice how some nodes appear to have a much larger degree than others now!b.
Degree DistributionLet pk the probability that a randomly selected node has a degree k.
The degree distribution follows a power-law :Power-law degree distributionThe distribution is now heavy-tailed.
There is a large number of nodes that have a small degree, but a significant number of nodes have a high degree.
degree_freq = np.
array(nx.
degree_histogram(G_barabasi)).
astype('float')plt.
figure(figsize=(12, 8))plt.
stem(degree_freq)plt.
ylabel("Frequence")plt.
xlabel("Degree")plt.
show()Degree DistributionThe distribution is said to be scale-free, in the sense that the average degree is not informative.
c.
Descriptive statisticsThe average degree is constant if α≤2, else, it divergesThe maximum degree is :# Get the list of the degreesdegree_sequence_erdos = list(G_erdos.
degree())nb_nodes = nnb_arr = len(G_erdos.
edges())avg_degree = np.
mean(np.
array(degree_sequence_erdos)[:,1])med_degree = np.
median(np.
array(degree_sequence_erdos)[:,1])max_degree = max(np.
array(degree_sequence_erdos)[:,1])min_degree = np.
min(np.
array(degree_sequence_erdos)[:,1])esp_degree = (n-1)*pprint("Number of nodes : " + str(nb_nodes))print("Number of edges : " + str(nb_arr))print("Maximum degree : " + str(max_degree))print("Minimum degree : " + str(min_degree))print("Average degree : " + str(avg_degree))print("Expected degree : " + str(esp_degree))print("Median degree : " + str(med_degree))This should give you something similar to :Number of nodes : 200Number of edges : 3949Maximum degree : 56Minimum degree : 25Average degree : 39.
49Expected degree : 39.
800000000000004Median degree : 39.
5IV.
ConclusionSo far, we covered the main kind of graphs, and the most basic characteristics to describe a graph.
In the next article, we’ll dive into graph analysis/algorithms and the different ways a graph can be analyzed, used for :Real-time fraud detectionReal-time recommendationsStreamline regulatory complianceManagement and monitoring of complex networksIdentity and access managementSocial applications/features…Feel free to comment if you have any question or remark.
Stay tuned, the next article is coming out next week :)Sources :A Comprehensive Guide to Graph Algorithms in Neo4j, Mark Needham & Amy E.
HodlerNetworkx documentation, https://networkx.
github.
io/documentation/stable/.