Using Image Data to Determine Text Structure

No, and there are plenty of improvements that can be made.

I juggled with the idea of using the circularity of the second character or perhaps considering the steepness of the slope made between the two centers of characters being compared.

Both these ideas would require more hard coded numbers that are unlikely to be reliable with other fonts and styles.

I figured it may be easier in the future to recognize that a letter has a random dot.

While connecting the dots did not always produce a perfect solution, it greatly assisted with cleaning up the image for line detection.

Walk the LineThere were two methods devised for finding all the letters that belonged to a single line.

Both ideas were based on manipulating data that was scavenged from the image.

A line of text is characterized by a series of words that sit on the same height on a page.

Simply put, each letter that belongs to a specific line should have a similar y-axis value.

This method was inspired by principal component analysis, which is a means of simplifying the number of dimensions of data set.

Number of overlapping letters (Right) the projection of the constitution on the Y-axis (Left)At the moment the idea of projecting the coordinates on to the y-axis seemed ingenious.

However, how to proceed from this step was unclear.

Initially I thought to frame this as a clustering problem and apply a mean-shift algorithm to find potential clusters.

My concern was that mean-shift required introducing additional hard coded parameters.

I didn’t want to build an OCR that could only read this rendition of The Constitution.

For the first method, I took inspiration from the famous image processing technique known as Otsu’s threshold.

Otsu’s threshold is traditionally used to find a global threshold value for an image.

It works best when there is a bi-modal distribution in a histogram.

I’ve used this method in numerous other projects.

At the heart of the threshold method is a mathematical technique that can be applied to any set organized into a histogram.

Finding a threshold is based on the technique of minimizing the intra-class variance of the histogram data.

A strong explanation of implementing this algorithm in Python is demonstrated in the OpenCV documentation.

Instead of directly using the projected coordinates, I extrapolated further date from these points.

Between each adjacent pair of projected points, I found the difference in position (the distance of separation).

I figured the distance on the y-axis between each point will distinguish if a new line has begun.

These distances were placed into a histogram.

Histogram of typed (Left) and handwritten (Right) textOnce the threshold value was determined, each letter was sequentially examined.

The distance between the letter and the letter it led was found.

If the difference was greater than the threshold, the two letters were considered to be on different lines.

Otherwise, the letters were on the same line.

The method worked fairly well for typed text.

While it did struggle to pick up lines occasionally, the Otsu method worked well when line spacing was consistent.

Handwritten text did not produce as successful results, which is a generous critique.

The method did find lines.

From the test image, however, only half of the lines were found.

Additionally, the irregularity that arises from commas, parenthesis, and subscripts generated a false positive.

An alternative approach involved treating the distances between each pixel as a function.

Such a function creates a plot in which areas where new lines occur contain a large spike.

Letters that are on the same line will produce regions of low values, ideally zero.

This method still used a threshold to determine whether a letter was on the next line or not.

The average and standard deviation of the distances was taken.

The sum of the average and standard deviation acted as a threshold.

Each point along the function was looped through.

Regions between two threshold passing spikes were considered to be a single line.

Finding the center point of these regions gave an index of a letter on that line.

The bottom dimension of the letter was used to determine the position of the line on the y-axis.

The second approach produced impressively good results.

The typed text was perfectly determined with the basic example.

A more complex example (shown at the end) did experience problems with super scripts, such as quotes.

Handwritten script was able to find lines, although results did tend to be more messy.

In the example below, each line of letters is surrounded by a red bar above and below.

The red line above is actually detecting the hanging superscripts.

In the typed text this was mitigated by combing our dots.

The hand written text is taken from a linear algebra lecture and is full of exponents that get detected as separate lines.

Overall, I’m fairly happy with the results of my custom method.

In this case, it is better to detect extraneous lines opposed to no lines.

Further data analysis can be performed to recognize which lines belong to superscripts.

Once this is determined, the line can be combined with its lower neighbor.

Now that we know which letters belong to which line, we can determine which letters belong to which word.

To find the lines, the y-axis coordinates were projected.

To find words, for each line the x-axis coordinates should be projected.

A similar distance algorithm will be able to detect between which letters are part of the same word.

AfterthoughtWhile I’ve enjoyed building the absolute beginnings of an OCR from scratch, for the remainder of this project, I plan to switch over to using Tesseract.

Tesseract is a open source OCR supported by Google.

I can probably spend several more articles writing my own convoluted neural network to interpret text.

However, I do want this project to ultimately be robust and I’m sure the good folks at Google will assist me with that.

GithubQuick disclaimer.

The script in this Github is currently incredibly unorganized and does not represent a final code.

I plan to dedicate some time to clean up the code on a later date.

TimChinenov/PictureTextA basic image processing code to detect text on a high contrast image – TimChinenov/PictureTextgithub.

comNotice the error caused by the quotations between 400 and 500.. More details

Leave a Reply