How Search Engines like Google Retrieve Results: Introduction to Information Extraction using Python and spaCy

Pattern: X, including Y doc = nlp(“Eight people, including two children, were injured in the explosion”) for tok in doc: print(tok.

text, “–>”,tok.

dep_, “–>”,tok.

pos_) Output: Eight –> nummod –> NUM people –> nsubjpass –> NOUN , –> punct –> PUNCT including –> prep –> VERB two –> nummod –> NUM children –> pobj –> NOUN , –> punct –> PUNCT were –> auxpass –> VERB injured –> ROOT –> VERB in –> prep –> ADP the –> det –> DET explosion –> pobj –> NOUN # Matcher class object matcher = Matcher(nlp.

vocab) #define the pattern pattern = [{DEP:nummod,OP:”?”}, # numeric modifier {DEP:amod,OP:”?”}, # adjectival modifier {POS:NOUN}, {IS_PUNCT: True}, {LOWER: including}, {DEP:nummod,OP:”?”}, {DEP:amod,OP:”?”}, {POS:NOUN}] matcher.

add(“matching_1”, None, pattern) matches = matcher(doc) span = doc[matches[0][1]:matches[0][2]] span.

text Output: ‘Eight people, including two children’ Pattern: X, especially Y doc = nlp(“A healthy eating pattern includes fruits, especially whole fruits.

“) for tok in doc: print(tok.

text, tok.

dep_, tok.

pos_) Output: A –> det –> DET healthy –> amod –> ADJ eating –> compound –> NOUN pattern –> nsubj –> NOUN includes –> ROOT –> VERB fruits –> dobj –> NOUN , –> punct –> PUNCT especially –> advmod –> ADV whole –> amod –> ADJ fruits –> appos –> NOUN .

–> punct –> PUNCT # Matcher class object matcher = Matcher(nlp.

vocab) #define the pattern pattern = [{DEP:nummod,OP:”?”}, {DEP:amod,OP:”?”}, {POS:NOUN}, {IS_PUNCT:True}, {LOWER: especially}, {DEP:nummod,OP:”?”}, {DEP:amod,OP:”?”}, {POS:NOUN}] matcher.

add(“matching_1”, None, pattern) matches = matcher(doc) span = doc[matches[0][1]:matches[0][2]] span.

text Output: ‘fruits, especially whole fruits’   2.

Subtree Matching for Relation Extraction The simple rule-based methods work well for information extraction tasks.

However, they have a few drawbacks and shortcomings.

We have to be extremely creative to come up with new rules to capture different patterns.

It is difficult to build patterns that generalize well across different sentences.

To enhance the rule-based methods for relation/information extraction, we should try to understand the dependency structure of the sentences at hand.

Let’s take a sample text and build its dependency graphing tree: text = “Tableau was recently acquired by Salesforce.

” # Plot the dependency graph doc = nlp(text) displacy.

render(doc, style=dep,jupyter=True) Output: Can you find any interesting relation in this sentence?.If you look at the entities in the sentence – Tableau and Salesforce – they are related by the term ‘acquired’.

So, the pattern I can extract from this sentence is either “Salesforce acquired Tableau” or “X acquired Y”.

Now consider this statement: “Careem, a ride-hailing major in the middle east, was acquired by Uber.

” Its dependency graph will look something like this: Pretty scary, right?.Don’t worry!.All we have to check is which dependency paths are common between multiple sentences.

This method is known as Subtree matching.

For instance, if we compare this statement with the previous one: We will just consider the common dependency paths and extract the entities and the relation (acquired) between them.

Hence, the relations extracted from these sentences are: Salesforce acquired Tableau Uber acquired Careem Let’s try to implement this technique in Python.

We will again use spaCy as it makes it pretty easy to traverse a dependency tree.

We will start by taking a look at the dependency tags and POS tags of the words in the sentence: text = “Tableau was recently acquired by Salesforce.

” doc = nlp(text) for tok in doc: print(tok.

text,”–>”,tok.

dep_,”–>”,tok.

pos_) Output: Tableau –> nsubjpass –> PROPN was –> auxpass –> VERB recently –> advmod –> ADV acquired –> ROOT –> VERB by –> agent –> ADP Salesforce –> pobj –> PROPN .

–> punct –> PUNCT Here, the dependency tag for “Tableau” is nsubjpass which stands for a passive subject (as it is a passive sentence).

The other entity “Salesforce” is the object in this sentence and the term “acquired” is the ROOT of the sentence which means it somehow connects the object and the subject.

Let’s define a function to perform subtree matching: def subtree_matcher(doc): x = y = # iterate through all the tokens in the input sentence for i,tok in enumerate(doc): # extract subject if tok.

dep_.

find(“subjpass”) == True: y = tok.

text # extract object if tok.

dep_.

endswith(“obj”) == True: x = tok.

text return x,y In this case, we just have to find all those sentences that: Have two entities, and The term “acquired” as the only ROOT in the sentence We can then capture the subject and the object from the sentences.

Let’s call the above function: subtree_matcher(doc) Output: (‘Salesforce’, ‘Tableau’) Here, the subject is the acquirer and the object is the entity that is getting acquired.

Let’s use the same function, subtree_matcher( ), to extract entities related by the same relation (“acquired”): text_2 = “Careem, a ride hailing major in middle east, was acquired by Uber.

” doc_2 = nlp(text_2) subtree_matcher(doc_2) Output: (‘Uber’, ‘Careem’) Did you see what happened here?.This sentence had more words and punctuation marks but still, our logic worked and successfully extracted the related entities.

But wait – what if I change the sentence from passive to active voice?.Will our logic still work?.text_3 = “Salesforce recently acquired Tableau.

” doc_3 = nlp(text_3) subtree_matcher(doc_3) Output: (‘Tableau’, ‘ ‘) That’s not quite what we expected.

The function has failed to capture ‘Salesforce’ and wrongly returned ‘Tableau’ as the acquirer.

So, what could go wrong?.Let’s understand the dependency tree of this sentence: for tok in doc_3:    print(tok.

text, “–>”,tok.

dep_, “–>”,tok.

pos_) Output: Salesforce –> nsubj –> PROPN recently –> advmod –> ADV acquired –> ROOT –> VERB Tableau –> dobj –> PROPN .

–> punct –> PUNCT It turns out that the grammatical functions (subject and object) of the terms ‘Salesforce’ and ‘Tableau’ have been interchanged in the active voice.

However, now the dependency tag for the subject has changed to ‘nsubj’ from ‘nsubjpass’.

This tag indicates that the sentence is in the active voice.

We can use this property to modify our subtree matching function.

Given below is the new function for subtree matching: def new_subtree_matcher(doc): subjpass = 0 for i,tok in enumerate(doc): # find dependency tag that contains the text “subjpass” if tok.

dep_.

find(“subjpass”) == True: subjpass = 1 x = y = # if subjpass == 1 then sentence is passive if subjpass == 1: for i,tok in enumerate(doc): if tok.

dep_.

find(“subjpass”) == True: y = tok.

text if tok.

dep_.

endswith(“obj”) == True: x = tok.

text # if subjpass == 0 then sentence is not passive else: for i,tok in enumerate(doc): if tok.

dep_.

endswith(“subj”) == True: x = tok.

text if tok.

dep_.

endswith(“obj”) == True: y = tok.

text return x,y Let’s try this new function on the active voice sentence: new_subtree_matcher(doc_3) Output: (‘Salesforce’, ‘Tableau’) Great!.The output is correct.

Let’s pass the previous passive sentence to this function: new_subtree_matcher(nlp(“Tableau was recently acquired by Salesforce.

“)) Output: (‘Salesforce’, ‘Tableau’) That’s exactly what we were looking for.

We have made the function slightly more general.

I would urge you to deep dive into the grammatical structure of different types of sentences and try to make this function more flexible.

  End Notes In this article, we learned about Information Extraction, the concept of relations and triples, and different methods for relation extraction.

Personally, I really enjoyed doing research on this topic and am planning to write a few more articles on more advanced methods for information extraction.

Although we have covered a lot of ground, we have just scratched the surface of the field of Information Extraction.

The next step is to use the techniques learned in this article on a real-world text dataset and see how effective these methods are.

If you’re new to the world of NLP, I highly recommend taking our popular course on the subject: Natural Language Processing (NLP) using Python You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.. More details

Leave a Reply