Eliminating manual data entry: Using OCR to convert images to text (Tesseract.js + React)

Let’s go through the above code a bit first:Our generateText method will take the uploads currently stored in our uploads react state, loop through each upload, and run it through tesseract.

Tesseract will process each image, and return a confidence score, text result, and pattern result.

Confidence is a 0–100 score of how accurate the conversion was.

We also defined a regex pattern so we can pick out words matching a specific pattern.

This would help us find needles in a haystack if needed.

The pattern here will take all words that are exactly 10 characters in length.

At the end of each loop, the result is stored in our documents state.

In our JSX, add an onClick event to our generate button to call the generateText method.

Replace the JSX results section with the code above — we call a map function to loop through our documents, and return the result on the frontend.

Step 5: Test!Upload some image documents with some text, hit that generate button, and watch the results roll in!NOTE: If tesseract errors are being returned after clicking the generate button, try adding tesseract via CDN inside the document head at public/index.

html:<script src=”https://cdn.

rawgit.

com/naptha/tesseract.

js/1.

0.

10/dist/tesseract.

js”></script>ConclusionThis is a bare bones example of how tesseract works.

This type of app can be expanded upon by taking full advantage of tesseract’s API (e.

g.

loading bars, character whitelisting, different languages, etc).

Check out the API docs here: https://github.

com/naptha/tesseract.

js#tesseractjsSo, did it work out for us?Unfortunately, the output was not accurate enough for us to use in this scenario.

This was due to various issues:Some documents given to use were not in good conditionFonts used in some documents would confuse tesseract.

jsIt couldn’t handle both English and Thai at the same time.

At the time of writing this, tesseract.

js cannot be trained to improve accuracy.

While it didn’t solve the particular issue we were having, it was still a super fun micro project to work on.

Tesseract.

js has it’s limitations, but it is just a port of the more sophisticated Tesseract OCR Engine, and we like to think it will only get better from here!P.

S.

We’re hiring!.Panya Studios is always on the lookout for talented and passionate individuals to join our growing team in Bangkok, Thailand.

Explore our current openings at https://studios.

panya.

me/.. More details

Leave a Reply