Digitizing handwritten vaccination records in Nigeria with AI and MLPiotr KrosniakBlockedUnblockFollowFollowingMay 17IntroductionWorking in UNICEF Nigeria as Polio Data Scientist I came to the problem of errors in vaccination cards that were delivered by 20,000 Polio Volunteers and even bigger issue number of the vaccination cards to check.
Simple digitizing and giving everyone a tablet was not an option.
After research, I decided to use AI/ML and computer vision to “read” information from the cards and then provide feedback mechanism about the most common errors and predict correct information.
In this tutorial, You will see how to achieve this and what are the results and recommendations for future optimizing.
I will be using python libraries TensorFlow and OpenCV mainly and some support libraries.
InstallationInstallation using TensorFlow varies with the OS and hardware you are going to use.
Refer this article for general instructions hereFor this tutorial, I will be using the following packages:OS: Linux _X64 (Arch Linux)Python package manager: Anaconda or Miniconda (Installation instructions here)CUDA 10.
0Python Tensorflow Api V1Opencv-pythonUsing miniconda (or anaconda), follow these steps to install the required python librariesCreating conda environmentconda create -n pyocrconda activate pyocrInstalling required packagesconda install tensorflowconda install opencvconda install -c lightsource2-tag pyzbarpip install editdistancePreserving library version for future replicationconda env export > <environment-name>.
ymlRecreate the environment on another machineTo recreate the environment on another machine, use this after creating and activating the environment in another machineconda env create -f <environment-name>.
ymlRecognizing text using TensorflowThe first thing to understand is that the accuracy of this model is dependant on the samples you are going to use for training.
More samples are needed for better accuracy.
This also means that if you need to recognize written text by multiple people, you have to include sufficient text samples written by themThe entire tutorial code is uploaded in the GitHub repository.
Clone this repository using git clone if you need the final codegit clone git@github.
git pyocrInputsCheck out the Inputs folder in the folder above.
Keep the images you want to run the script one here(for better organization)Fig 1: Github folder structure for input folderGet Training DataGet IAM datasetRegister at: http://www.
txt into the data/ directory.
Create the directory data/words/.
Put the content (directories a01, a02, …) of words.
tgz into data/words/i.
For the linux terminal — in folder data, run the linux command tar xvf words.
tgz -C words)Run checkDirs.
py for a rough check on the filesCheck if dir structure looks like this:data— test.
txt— words— — a01— — — a01–000u— — — — a01–000u-00–00.
png— — — — …— — — …— — a02— — …Training the modelExtract the model first.
Unzip the model.
zip fil into the same folder (<root>/model)Then, run the training in the src directory.
The script here will build upon the previously trained model and improve its accuracy based on your datapython main.
py — trainThis may take a long while to run the training — more like 16–18 hours without a GPU.
The script runs the training batches called epochs till there is no appreciable increase in text recognition accuracy between consecutive batches.
After completion, You will see files generated under the model folder.
Fig 2: The Model folder with the TensorFlow trained modelsThe snapshots and checkpoints will be generated as aboveRunning the OCR scriptNow that the model is generated in the code folder, let us run the code to get a text from our images.
Make sure you have your input files in the Input folderFig 3: Input folder with your imagesRun the code in the src folder(inside a terminal)Python Demo.
pyThe code will run on the input images.
You will see the output in the terminal as belowFig 4: Sample Terminal output on running inferenceOnce the code has completed running, outputs will be present in the Output folder:Fig 5: Output folder after running OCR scriptThe folders will contain the table cells with each cell as a separate image.
We will get to use these generated images to further improve our accuracy in the next sectionHowever, based on your current models, the recognized text will be saved in the CSV files with the same names as the input images.
These CSV files can be opened in spreadsheet software like Microsoft Excel or google sheetsImproving the AccuracyThe individual table cells from your images are saved as separate images in the Output folder.
These images can help the model recognize the handwriting -> text mapping for your own data set.
Typically, this is necessary if you have a lot of uncommon English words like names or the handwriting style in the images differ largely from the IAM default dataset the model was trained onTo use these table cell images to train your dataset, follow the steps below:Preprocess the images to make it IAM dataset compliant.
This is absolutely necessary for the script to get properly trained with your images.
On a higher level, the following steps are performed:a.
Thickening faint lines in the textb.
Removing extra spaces around the word with word segmentation (refer this code)c.
Improving contrast through a technique for thresholdingRenaming and copying the images in the data folder in the format used by the Dataloader.
py module:For example, A file c01–009–00–00.
png should be saved in the following folder hierarchy| Words| — a01| — — c01–009| — — — c01–009–00–00.
pngHowever, you can change these folder hierarchy/file naming conventions by editing the DataLoader.
Edit the words.
txt file in the data module to include these imagesThe following code performs operation 1a and bimport numpy as npimport cv2# readimg = cv2.
IMREAD_GRAYSCALE)# increase contrastpxmin = np.
min(img)pxmax = np.
max(img)imgContrast = (img — pxmin) / (pxmax — pxmin) * 255# increase line widthkernel = np.
ones((3, 3), np.
uint8)imgMorph = cv2.
erode(imgContrast, kernel, iterations = 1)# writecv2.
png’, imgMorph)To write the words.
txt file, follow the conventions in below format as applicable to your images:Sample line: a01–000u-00–00 ok 154 1 408 768 27 51 AT Aa01–000u-00–00 -> word id for line 00 in form a01–000u.
This is also the file name of the image you are mappingok -> result of word segmentationok: word was correctlyer: segmentation of word can be bad154 -> graylevel to binarize the line containing this word.
This is the contrast stretching/Thresholding step.
1 -> number of components for this word408 768 27 51 -> bounding box around this word in x,y,w,h formatAT -> the grammatical tag for this word, see thefile tagset.
txt for an explanationA -> the transcription for this word describing the text contents of the imageThe above will custom tailor the model for your images.
To improve the accuracy of the model itself, refer the improving accuracy section of this pageExplanation of the approachThe code perform three major steps:Match template and rotate imageRecognize rows in the table and cropRecognize text using python-tensorflowThe recognition algorithm is based on the simplified version of HTR system of text recognition.
If you are interested in the mechanism, you can refer this paperIt consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layerThe input image is a gray-value image and has a size of 128×325 CNN layers map the input image to a feature sequence of size 32×2562 LSTM layers with 256 units propagate information through the sequence and map the sequence to a matrix of size 32×80.
Each matrix-element represents a score for one of the 80 characters at one of the 32 time-stepsThe CTC layer either calculates the loss value given the matrix and the ground-truth text (when training), or it decodes the matrix to the final text with best path decoding or beam search decoding (when inferring)Batch size is set to 50Fig 5: Mechanisms involved in the OCR step using tensorflowConclusionFollowing this tutorial, you now have a way to automate the digitization of handwritten texts in tabular format tables.
Countless hours can be saved once you train the model to recognize your handwriting and customize according to your needs.
However, be careful as the recognition is not 100% accurate.
So, a round of high-level proofreading after the spreadsheet generation might be needed before you are ready to share the final spreadsheetReference:Code Reference:https://github.
com/PiotrKrosniak/ocrbotHandwriting recognition using google TensorFlow: https://towardsdatascience.
com/build-a-handwritten-text-recognition-system-using-tensorflow-2326a3487cd5Handling edge cases: https://towardsdatascience.
com/faq-build-a-handwritten-text-recognition-system-using-tensorflow-27648fb18519Dataset to start with: http://www.