Making the Mueller Report Searchable with OCR and Elasticsearch

It’s a list of dictionaries (json) which is perfect for ingestion by elastic to make it searchable.

Index in ElasticsearchThe first thing you need to do is make sure elastic is running on the proper port.

Open a terminal and start elastic (if it’s in your $PATH it should just be elasticsearch).

By default, this will start the service on port 9200.

After that, we can easily use the Python client to interact with our instance.

If elastic is running properly on port 9200, the following code should create the index mueller-report which has 2 fields: text and page (these correspond to our dictionary keys in the previous function).

Searching our IndexI won’t get into the specifics, but elastic uses a language called query DSL to interact with the indices.

There’s a lot you can do with it, but all we’re going to do here is create a function that will vectorize our query, and compare it with the text in our index for similarity.

The res will be a json that contains a bunch of info on our search.

Realistically though, we only want our relevant results.

So once we actually call the function, we can parse the json to get the most relevant text and page number.

With this, our search function looks for “department of justice” within the page texts, and returns the results.

the [0] in the statement above is just to look at the first, most relevant page text and number.

However, you can customize the parsing so that it returns as few/many results as you like.

Using the Kibana Front EndInstead of viewing a poorly recorded gif of my jupyter notebook, we can actually use another elastic tool to better view our results.

Kibana is an open source front end for elastic that’s great for visualization.

First, install kibana from this link.

SourceOnce you have Kibana installed, start the service by running kibana in a terminal and then navigate to localhost:5601 in your favorite web broswer.

This will let you interact with the application.

The only thing we have to do here before we interact with our index is create an index pattern.

Go to Management > Create Index Pattern, and then type in “mueller-report” — Kibana should let you know that the pattern matches the index we created earlier in elastic.

And that’s it!.If you go to the Discover tab on the left, and you can search your index in a much easier (and more aesthetic) manner than we were in elastic.

Next StepsIt would probably be cool to throw this up on AWS so anyone can use it (with a nicer front end), but I’m not really tryna tie my credit card to that instance at the moment.

If anyone else wants to, feel free!.I’ll update soon with a docker container and github link.

.. More details

Leave a Reply