Using NLP to build a search & discovery app for Regulators

Using NLP to build a search & discovery app for RegulatorsAbizer JafferjeeBlockedUnblockFollowFollowingFeb 7Regulations need to be updated constantly in this era of rapid socio-economic and technological change.

Regulators spend a substantial amount of time assessing the current stock of Acts to identify inconsistent use of language or markers that don’t support innovation and create a burden for businesses.

Given the large number of Acts and their complex nature, tools that can enable regulators to easily search and compare Acts and discover insights and patterns can be helpful in speeding up the process.

In my last article, I wrote more about some of the ways technologies like natural language processing can help regulators.

On the role of technology in Regulatory ModernizationWe need more agile legislation because it is difficult for regulators to cope with fast technological change.


comTo showcase what these tools could look like, I collaborated with the team at Datadex to build a demo app for Canadian regulators which uses NLP.

The application stores and analyzes 850 Canadian Acts and is designed to enable regulators to easily discover relevant Acts, link Acts by semantic similarities, and identify conflicting or inconsistent language and overlapping rules.

In this article I will show you how each of these tools can be useful for regulators and explain how you can easily build simple versions of them.

The ApplicationThe Home PageThe app contains 4 main tools: the first is a summaries tool which contains automatically generated summaries of each of the 850 Acts.

The second is a document graph tool which is an interactive graph that shows how Acts are related to each other.

The third is a context search tool where a regulator can pick words of interest in any context and find references to other Acts with sections that have similar contexts.

The forth is a word cloud tool which contains many important words from the corpus of Acts and enables regulators to retrieve Acts where the selected word is important.

Summaries ToolHow the summaries tool worksThe summaries tool was created to make it easy to browse through a set of Acts and quickly evaluate the subject matter.

Regulators can narrow down the list of Acts they want to see by inputting a keyword related to their subject of interest.

Below we use the example of a “hazard” and find that the “Hazardous Products Act” is a relevant Act.

The summary for this Act reads that a Minister can order a supplier to take any measure necessary to remedy a non-compliance in relation to the hazardous product.

Now, a regulator can decide whether this is an Act they want to analyze further.

Using summaries toolA simple way to generate summaries using NLP is to vectorize the text within the document.

There are multiple ways to vectorize text, one of them is Tf-idf vectorization.

This is a way to convert text into vectors where each word in a document is given a score that reflects how important the word is to that document in the collection of all documents.

The Tf-idf score of a word in a document increases proportionally to the number of times the word appears in that document and is offset by the number of documents in the corpus that contain the word.

Once we’ve got the Tf-idf scores for each word in the document, we can generate a score for each sentence which will indicate how important any sentence is to the Act.

Document Graph ToolInteractive Node-Edge graph of ActsThe document graph is an interactive node-edge graph which shows how Acts are related to each other.

Often, relations between documents are based on syntactical similarities such as grammatical structure, or on citations within their text.

However, using natural language processing we can calculate the semantic similarity between two documents, so that the relation between two documents are based on likeness of their meaning and context.

In the above graph, highly connected Acts are found closer to the center of the graph while Acts with fewer related Acts are found closer to the borders of the graph.

The similarity range filter can be used to set the similarity threshold of nodes displayed on the graph.

The main purpose of this tool was to enable regulators to easily discover relations between Acts that would otherwise be very difficult to identify.

Using semantic comparison, we can discover similarities in context, choice of language or subject matter which would be very difficult to do manually across the entire corpus of Acts.

Once the relations are identified, regulators gain a significant advantage in drawing insights from Acts or creating plans for how to investigate Acts.

For example, regulators can decide how they want to group Acts to evaluate their textual content and features.

They can also use the links to confirm relations between Acts that should exist and discover links between Acts which are not related.

How the document graph tool worksIn the graph above, we notice that highly interconnected Acts form clusters, for example the Act “Canada-Finland Tax Convention Act, 2006” forms a cluster with Canada’s Tax Convention Acts with other countries, while Acts on the edge have few connections to many Acts but are strongly related to just one other Act indicating possible unique textual characteristics within them.

Earlier we discussed converting text into vector representations.

We can use these vectors to decide whether two documents are related by comparing how close they are in a vector space.

When it comes to comparing text using vectors, what’s important is determining the information about textual features that’s stored in the vectors.

The Tf-idfs that we used above are good at storing information about the importance of each individual word but don’t tell us anything about the context words are used in.

Word2Vec is another vectorization method that is useful because it enables us to compare words or sentences by the context they are used in.

Word2Vec models takes inputs of target and context words where the target word represents the focus of the subject and the context words are the words found around (in context) of the target word.

The model is trained on the corpus of Acts and each unique word is assigned a vector.

On a vector space, words that share a common context will be located close to one another, so calculating the distance between vector representation of Acts within the vector space can give us an idea of how closely linked two Acts are.

Context Search ToolHow context search tool worksThe context search tool is designed to find references from sections of text which are semantically similar to a section selected in any particular Act.

The regulator can simply click on any of the underlined words in the text of an Act and get a list of references of sections from the same Act or other Acts that either use the same, similar or different word in the same context.

In the example above, we view the “Air Travelers Security Charge Act” and select the word “transportation” in the context of chargeable air travel.

The tool retrieves a list of references where the word “transportation” or any other word is used in the same context, for example the second reference retrieved from the same Act uses “transportation” in the context of air travel becoming chargeable after a set date.

In the case above, regulators can confirm if the word “transportation” is being used appropriately in its context by investigating whether it is being used in similar context across other Acts.

If the word is present across similar contexts, regulators can assess whether it is the right word to use, delivers the right effect and will have the right interpretation.

If the same word is not being used to describe similar contexts consistently, then it raises the issue of inconsistent use of language.

Misinterpretation and inconsistency in the use of language in Acts can be consequential for businesses as it can lead to a failure to understand laws and take the right actions.

Here’s another example of context searchAbove is another example of the context search tool being used where selecting the word “terrorism” in the “Justice for Victims of Terrorism Act” returns a reference to the use of “terrorism” in the same context in the “Anti-terrorism Act”.

Word Cloud ToolHow the word cloud tool worksThe word cloud was created to make it easy for regulators to find Acts linked to topic words.

In the above example, we select “amendments” and get a list of all Acts which are relevant to that topic.

Topic words are generated from the entire corpus of Acts and reflect the most important or relevant words being used within the text.

Each topic word is then linked to each Act within the corpus where it is important and relevant to the subject-matter.

The advantage for regulators in discovering Acts in this way is that they don’t need to have the names of all the Acts related to a topic before-hand.

Also, if they don’t have a topic in mind, they can explore the range of topics in the word cloud and choose Acts by topic word.

Furthermore, topic words are sized by their importance or the number of Acts they are important in.

This can help regulators in their analysis of how different words are being used in the corpus of Acts and answer questions like why some words are being overused, underused or used inappropriately in linked Acts.

Tf-idf transformations from earlier can be used to identify the most important topic words within an Act.

Recall that the Tf-idf score given to each word is an indication of the importance of that word in the Act.

We can set a threshold for the score and then select any word which has a score greater than the threshold to be a topic word.

Once a topic word is identified in one Act, we can search all other Acts to see whether that word is an important word in other Acts and link those Acts to the selected topic word.

The above application shows the benefit of tools that can understand and compare text at a semantic level for enabling regulators to be more effective in searching and analyzing Acts.

These ensembles of tools are only a glimpse of what can be achieve by technologies like NLP.

Soon, advanced tools that enable regulators to do other things like detect future trends in regulatory compliance, pool and analyze text from Acts with data from external sources such as public commentary on Acts or provide recommendations for Acts that need changes will be common place among the regulators tool kit.

.. More details

Leave a Reply