Knowledge Graph – A Powerful Data Science Technique to Mine Information from Text (with Python code) |

Entities Extraction The extraction of a single word entity from a sentence is not a tough task.

We can easily do this with the help of parts of speech (POS) tags.

The nouns and the proper nouns would be our entities.

However, when an entity spans across multiple words, then POS tags alone are not sufficient.

We need to parse the dependency tree of the sentence.

You can read more about dependency parsing in this article.

Let’s get the dependency tags for one of the shortlisted sentences.

I will use the popular spaCy library for this task: View the code on Gist.

Output: The … det 22-year … amod – … punct old … nsubj recently … advmod won … ROOT ATP … compound Challenger … compound tournament … dobj .

… punct The subject (nsubj) in this sentence as per the dependency parser is “old”.

That is not the desired entity.

We wanted to extract “22-year-old” instead.

The dependency tag of “22-year” is amod which means it is a modifier of “old”.

Hence, we should define a rule to extract such entities.

The rule can be something like this — extract the subject/object along with its modifiers and also extract the punctuation marks between them.

But then look at the object (dobj) in the sentence.

It is just “tournament” instead of “ATP Challenger tournament”.

Here, we don’t have the modifiers but compound words.

Compound words are those words that collectively form a new term with a different meaning.

Therefore, we can update the above rule to ⁠— extract the subject/object along with its modifiers, compound words and also extract the punctuation marks between them.

In short, we will use dependency parsing to extract entities.

Extract Relations Entity extraction is half the job done.

To build a knowledge graph, we need edges to connect the nodes (entities) to one another.

These edges are the relations between a pair of nodes.

Let’s go back to the example in the last section.

We shortlisted a couple of sentences to build a knowledge graph: Can you guess the relation between the subject and the object in these two sentences?.Both sentences have the same relation – “won”.

Let’s see how these relations can be extracted.

We will again use dependency parsing: View the code on Gist.

Output: Nagal … nsubj won … ROOT the … det first … amod set … dobj .

… punct To extract the relation, we have to find the ROOT of the sentence (which is also the verb of the sentence).

Hence, the relation extracted from this sentence would be “won”.

Finally, the knowledge graph from these two sentences will be like this: Build a Knowledge Graph from Text Data Time to get our hands on some code!.Let’s fire up our Jupyter Notebooks (or whatever IDE you prefer).

We will build a knowledge graph from scratch by using the text from a set of movies and films related to Wikipedia articles.

I have already extracted around 4,300 sentences from over 500 Wikipedia articles.

Each of these sentences contains exactly two entities – one subject and one object.

You can download these sentences from here.

I suggest using Google Colab for this implementation to speed up the computation time.

Import Libraries View the code on Gist.

Read Data Read the CSV file containing the Wikipedia sentences: View the code on Gist.

Output: (4318, 1) Let’s inspect a few sample sentences: candidate_sentences[sentence].

sample(5) Output: Let’s check the subject and object of one of these sentences.

Ideally, there should be one subject and one object in the sentence: View the code on Gist.

Output: Perfect!.There is only one subject (‘process’) and only one object (‘standard’).

You can check for other sentences in a similar manner.

Entity Pairs Extraction To build a knowledge graph, the most important things are the nodes and the edges between them.

These nodes are going to be the entities that are present in the Wikipedia sentences.

Edges are the relationships connecting these entities to one another.

We will extract these elements in an unsupervised manner, i.

, we will use the grammar of the sentences.

The main idea is to go through a sentence and extract the subject and the object as and when they are encountered.

However, there are a few challenges ⁠— an entity can span across multiple words, eg.

, “red wine”, and the dependency parsers tag only the individual words as subjects or objects.

So, I have created a function below to extract the subject and the object (entities) from a sentence while also overcoming the challenges mentioned above.

I have partitioned the code into multiple chunks for your convenience: View the code on Gist.

Let me explain the code chunks in the function above: Chunk 1 I have defined a few empty variables in this chunk.

prv_tok_dep and prv_tok_text will hold the dependency tag of the previous word in the sentence and that previous word itself, respectively.

prefix and modifier will hold the text that is associated with the subject or the object.

Chunk 2 Next, we will loop through the tokens in the sentence.

We will first check if the token is a punctuation mark or not.

If yes, then we will ignore it and move on to the next token.

If the token is a part of a compound word (dependency tag = “compound”), we will keep it in the prefix variable.

A compound word is a combination of multiple words linked to form a word with a new meaning (example – “Football Stadium”, “animal lover”).

As and when we come across a subject or an object in the sentence, we will add this prefix to it.

We will do the same thing with the modifier words, such as “nice shirt”, “big house”, etc.

Chunk 3 Here, if the token is the subject, then it will be captured as the first entity in the ent1 variable.

Variables such as prefix, modifier, prv_tok_dep, and prv_tok_text will be reset.

Chunk 4 Here, if the token is the object, then it will be captured as the second entity in the ent2 variable.

Variables such as prefix, modifier, prv_tok_dep, and prv_tok_text will again be reset.

Chunk 5 Once we have captured the subject and the object in the sentence, we will update the previous token and its dependency tag.

Let’s test this function on a sentence: get_entities(“the film had 200 patents”) Output: [‘film’, ‘200 patents’] Great, it seems to be working as planned.

In the above sentence, ‘film’ is the subject and ‘200 patents’ is the object.

Now we can use this function to extract these entity pairs for all the sentences in our data: View the code on Gist.

The list entity_pairs contains all the subject-object pairs from the Wikipedia sentences.

Let’s have a look at a few of them: entity_pairs[10:20] Output: As you can see, there are a few pronouns in these entity pairs such as ‘we’, ‘it’, ‘she’, etc.

We’d like to have proper nouns or nouns instead.

Perhaps we can further improve the get_entities( ) function to filter out pronouns.

For the time being, let’s leave it as it is and move on to the relation extraction part.

Relation / Predicate Extraction This is going to be a very interesting aspect of this article.

Our hypothesis is that the predicate is actually the main verb in a sentence.

For example, in the sentence – “Sixty Hollywood musicals were released in 1929”, the verb is “released in” and this is what we are going to use as the predicate for the triple generated from this sentence.

The function below is capable of capturing such predicates from the sentences.

Here, I have used spaCy’s rule-based matching: View the code on Gist.

The pattern defined in the function tries to find the ROOT word or the main verb in the sentence.

Once the ROOT is identified, then the pattern checks whether it is followed by a preposition (‘prep’) or an agent word.

If yes, then it is added to the ROOT word.

Let me show you a glimpse of this function: get_entities(“John completed the task”) Output: completed Similarly, let’s get the relations from all the Wikipedia sentences: relations = [get_relation(i) for i in tqdm(candidate_sentences[sentence])] Let’s take a look at the most frequent relations or predicates that we have just extracted: pd.

Series(relations).

value_counts()[:50] Output: It turns out that relations like “A is B” and “A was B” are the most common relations.

However, there are quite a few relations that are more associated with the overall theme – “the ecosystem around movies”.

Some of the examples are “composed by”, “released in”, “produced”, “written by” and a few more.

Build a Knowledge Graph We will finally create a knowledge graph from the extracted entities (subject-object pairs) and the predicates (relation between entities).

Let’s create a dataframe of entities and predicates: View the code on Gist.

Next, we will use the networkx library to create a network from this dataframe.

The nodes will represent the entities and the edges or connections between the nodes will represent the relations between the nodes.

It is going to be a directed graph.

In other words, the relation between any connected node pair is not two-way, it is only from one node to another.

For example, “John eats pasta”: View the code on Gist.

Let’s plot the network: View the code on Gist.

Output: Well, this is not exactly what we were hoping for (still looks quite a sight though!).

It turns out that we have created a graph with all the relations that we had.

It becomes really hard to visualize a graph with these many relations or predicates.

So, it’s advisable to use only a few important relations to visualize a graph.

I will take one relation at a time.

Let’s start with the relation “composed by”: View the code on Gist.

Output: That’s a much cleaner graph.

Here the arrows point towards the composers.

For instance, A.

Rahman, who is a renowned music composer, has entities like “soundtrack score”, “film score”, and “music” connected to him in the graph above.

Let’s check out a few more relations.

Since writing is an important role in any movie, I would like to visualize the graph for the “written by” relation: View the code on Gist.

Output: Awesome!.This knowledge graph is giving us some extraordinary information.

Guys like Javed Akhtar, Krishna Chaitanya, and Jaideep Sahni are all famous lyricists and this graph beautifully captures this relationship.

Let’s see the knowledge graph of another important predicate, i.

, the “released in”: View the code on Gist.

Output: I can see quite a few interesting information in this graph.

For example, look at this relationship – “several action horror movies released in the 1980s” and “pk released on 4844 screens”.

These are facts and it shows us that we can mine such facts from just text.

That’s quite amazing!. End Notes In this article, we learned how to extract information from a given text in the form of triples and build a knowledge graph from it.

However, we restricted ourselves to use sentences with exactly 2 entities.

Even then we were able to build quite informative knowledge graphs.

Imagine the potential we have here!.I encourage you to explore this field of information extraction more to learn extraction of more complex relationships.

In case you have any doubt or you want to share your thoughts, please feel free to use the comments section below.

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.. More details

Post Views: 96

Leave a Reply Cancel reply

Related