But let us take it at face value, and create metrics from them.

These metrics will ultimately derive features for us.

Metric 1: Deductive Metric (DED)It follows deductive reasoning and is supported by the generalisations of a node: if my four grandparents are mortal, I will probably be mortal too.

The formula for calculating DED metric is:The metric gives us the likelihood of edge x → y.

This process is a top-down one, for estimating edge likelihood.

graphic representation of the top-down deductive process for estimating edge likelihood according to DED.

Metric 2: Inductive Metric (IND)It follows inductive reasoning and is supported by the specialisations of a node: If most of a author’s publication are sophisticated, then it is very likely that the author himself is a sophisticated person.

The formula for computing IND metric is:The metric gives us the likelihood of edge x → y.

This process is a bottom-top one, for estimating edge likelihood.

graphic representation of the bottom-up inductive process for estimating edge likelihood according to IND.

According to the two metrics defined above (DED & IND), likelihood of an edge is obtained from the proportion of satisfying evidence (how many of all my generalisations/specialisations have that edge).

However, a proportion may not be the best measure of significance in certain domains.

The absolute amount of evidence can be a decisive factor of reliability.

Consider the following example:The orange directed line is the edge which is to be classified as label 0/1.

In the following fig, A(x)₁ is the only ancestor of node x, and is also connected to node y.

This gives us the DED Metric equal to 1.

Fig A: A(x) = { A(x)₁}Here the set A(x) consists of 100 nodes, out of which 99 (excluding A(x)₄₇) are connected to node x.

This would give a DED metric equal to 0.

99 .

Fig A: A(x) = { A(z)₁, A(z)₂, A(z)₃,….

A(z)₁₀₀}In many domains edge z→ y seems more reliable than edge x → y because of the absolute amount of evidence (100 vs 1), even though it has less proportional evidence (99% vs 100%).

Given a modification of DED and IND in which the amount of satisfying ancestors and descendants, not only their proportion, is taken into account.

Metric 3: DED_LOGDED_LOG : Modified DED metricMetric 4: IND_LOGIND_LOG : Modified IND metricNow from these 4 metrics, we will be deriving the features as follows.

Feature 2: INF ScoreCombine both DED and IND into a single score.

This aggregates the evidence provided by both the ancestors and the descendants of a vertex in order to determine the likelihood with which edges originating from that vertex exist.

Feature 3: INF_2D ScoreThe authors in  noticed that the DED score typically achieves higher precisions than IND.

The combination of DED and IND outperforms DED alone.

They modified INF as INF_2D where DED score is given twice the weight of the IND score.

Feature 4: INF_2I ScoreWe also create INF_2I where IND score is given twice the weight of the DED score.

Feature 5: INF_LOG ScoreA modification of INF in which the amount of satisfying ancestors and descendants, not only their proportion, is taken into account.

Feature 6: INF_LOG_2D ScoreCombining INF_LOG and INF_2D to build INF_LOG_2D, which addresses issues of 1) proportion of satisfying ancestors/descendants 2) low precision in normal INF score.

Feature 7: INF_LOG_2I ScoreWe also create INF_LOG_2I where IND_LOG score is given twice the weight of the DED _LOG score.

Next set of features are Hierarchical Undirected features.

We reformulate classical link predictions features for undirected graph link prediction to be applied in directed graphs.

The authors in defined the specificity level of a vertex in a graph as the number of distinct vertices from which it can be reached, regardless of the path distance.

A sort of recursive descendants set size.

Between two nodes A & B, the node with more number on incoming edges (say A) is said to be more specific than the other node (B); while B is said to be more abstract than A.

The authors of  proposed hierarchical version of the top three undirected link prediction algorithms CN, AA and RA.

They only considers relations going from a given vertex to those which are more abstract than itself.

That is find the ususal similarity for edge x → y if and only if node x is more specific than node y.

If the condition is not met, than set similarity for the edge to be 0.

I extended the list of undirected link prediction algorithms.

I found these in a survey paper by .

First of all, we convert both the directed graphs (train_graph and test_graph) to their undirected equivalent.

Feature 8: Common Neighbors The similarity between two nodes is defined as the number of shared neighbors between both nodes.

Feature 9: Resource Allocation IndexMotivated by the resource allocation process that takes place in complex networks.

Feature 10: Jaccard CoefficientIt measures the ratio of shared neighbors in the complete set of neighbors for two nodes.

This similarity function is defined asFeature 11: Adamic Adar IndexMeasures the similarity between two entities based on their shared features.

However, each feature weight is logarithmically penalized by its appearance frequency.

Feature 12: Preferential AttachmentBased on the observation that the probability of link formation between two nodes increases as the degree of these nodes doesFeature 13: Resource Allocation based on Common Neighbour Interactions (RA-CNI)Motivated by the resource allocation process where each node sends a unit of resource to its neighbors.

Feature 14: Salton Index The Salton index yields a value that is approximately twice the Jaccard index.

Feature 15: Sørensen Index (SO) Despite its similarity with the Jaccard index, it is less sensitive to outliers.

Feature 16: Hub Promoted Index (HPI) The main goal of this similarity measure is to avoid link formation between hub nodes and promote link formation between low-degree nodes and hubs.

Feature 17: Hub Depressed Index (HDI) The hub depressed index promotes link formation between hubs and between low-degree nodes, but not between hubs and low-degree nodes.

Till now, we have created 17 features.

In the next part(s) of this article, we will compute feature importances of the similarities we have found.

We will deal with the severe class imbalance problem that is haunting our train and test datasets.

All this and other things will follow in the upcoming part.