My Journey with Optical Character Recognition

Rhys Thomas
9 min readOct 30, 2020

After coronavirus stranded me and my girlfriend in New Zealand in February, we had to make some difficult decisions. Should we come home or apply for a working visa here in NZ? We had only been away from the UK for 2 months so we decided we weren’t ready to go home and applied for the 1-year working holiday visa which we were eventually granted in May of this year. After weeks of job searching and rejection, I was offered a job doing data entry for the Electoral Commission of New Zealand. This role involved processing thousands of handwritten forms to enter that data into a database. I have always sought an easier way of doing a job, so my journey with optical character recognition (the fancy name for handwriting recognition) began.

Optical character recognition (OCR) tools typically take in an image (JPEG, PNG, or even PDF) and use pattern recognition, often powered by machine learning, to magically “read” each word and newline. For each of these words, the tool has a confidence threshold that if it is met, the word is deemed accurate and is returned to the user. If the confidence is not met, no word is returned.

Research

After researching and experimenting with several tools, I found Google’s Vision API the most accurate, returning about 80–90% of my example text words accurately (see example text left). This is most likely due to the processing power that Google has behind this tool, utilising some of the best machine learning in the world to empower its handwriting recognition. By contrast, an alternative tool that I used — SimpleOCR — returned about 10% of the words from my example. Another benefit of using Google Vision was that it had a publicly-accesible API, so I would be able to interact with the tool through a simple request and response.

As every project starts out, the optimistic and naive side of me thought this would take me a week, maybe 2 at tops. It's as simple as sending an image to the API, maybe un-mangling the response, and injecting that into the fields. How hard could it be? As work began, I realised all of the extraneous work that would be required for this to become a viable and useful tool.

Python attempt no.1

As I didn’t have access to the electoral commission’s test environment, I would need somewhere to test my scripts. Luckily the system currently in place (called MIKE) is pretty user-friendly and simple, so with some basic HTML and CSS, I was able to whip up my mock version of MIKE pretty quickly. I went to the library to print off an enrolment form that I filled in with incredibly neat handwriting for a best-case scenario that I would be testing my script with. I favoured Python as my language of choice because of its elegance and simplicity when it comes to making API requests. After getting the Vision API setup, it returns a large ‘blob’ of text that is the words from the entire document in a big long string. After manipulating the response to extract all of the words I wanted, I needed a way to inject these words into my fake MIKE. In a previous project I had used Webdriver.IO to automate web interaction but this framework was restricted to JavaScript, and I didn’t want to rewrite my whole script (which I ended up doing later anyway), so Selenium Webdriver enabled me to interact with these fields. Once it was running, my script would start a Chrome browser, navigate to my fake MIKE, send the image on the webpage to Vision, extract the words I needed and inject them into my fields. Job done right?

Well at this stage I thought I should probably commit something to Github as I didn’t want to lose this work if my laptop decided to commit seppuku. I committed and pushed my work to this repo and went to sleep (my code is probably ugly and inefficient, but it works). I awoke to a bombardment of emails from Google saying my API key has been compromised and is now being used for Crypto-mining. Nice one Rhys. I had committed my API key to a public Github repo which had promptly been stolen, and now my Google API account had been revoked. After 2 weeks of appeals to get my account reinstated, I was eventually allowed back in (after removing the key from my repo obviously).

This had given me a chance to think about the practical use of this tool. Running a command from the terminal to start a script is not “user-friendly” and the data entry operators who would be using it are not exactly the most tech-savvy people you’ve ever met. I made a GUI for my script using PySimpleGui which did the job, but this still had to be run from the command line and started its own Chrome browser.

After trying unsuccessfully to bundle the project into a single executable, I thought further about how this tool should be implemented. If there was a button built into the webpage which ran the script, this would negate the need for any executables or terminal commands. It turns out Python is not the easiest thing to implement into a webpage. I wanted everything done client-side, and this didn’t seem possible with Python. Enter JavaScript.

JS

I used a Webdriver.IO framework to get the script up and running so it would start my browser and run the OCR and inject the response into my fields (after admittedly a long time trying to set this up). My 200 line Python script now ran in 65 lines in JavaScript as I had come up with a much cleaner way to do things, so I’ve got that going for me which is nice. The script could now be added directly to my HTML! Or so I thought. After adding an onscreen button to run the script and nothing happened when I pressed it, I realised that the browser had no context on the dependencies that I was using. It seems that every time I feel like I’m getting close, something gets in the way. I messaged my friend who was a frontend dev about my problem and he informed me of something called Webpack which bundles all of your dependencies to run with your script. Great! I also told him of my duplicate project in Python, when he reminded me that I could set up my own service in an AWS Lambda and use their API gateway to trigger the script and receive a response which could be handled by JavaScript to inject into my fields. This sounded like a much more interesting way of doing things as I had never set up the API Gateway before, though I have used Lambdas once before when I created a Pinterest auto-pinner that ran nightly.

Back to Python

Before chucking everything into a Lambda, I wanted to tidy the script up to remove junk and make it cleaner, so I managed to reduce my 200 line script down to about 85 lines. After racking my brains trying to remember how to create the deployment package for the Lambda (why is there a difference between compressing a folder and compressing all of the contents of the folder?!) I altered my script to be able to ingest a JSON request body (just the URL of the image I want to convert to text) and produce a JSON response. Upon testing the Lambda, I keep getting an error saying that one of Google’s modules can’t be found (even though its right there in the package). Investation found that for certain projects, particularly ones that use Google Cloud Platform, you can’t just compile the deployment package from your machine, “you have to do it using a specific OS/”setup”, the same one that AWS Lambda uses to run your code”. Great. Enter the third and hopefully final Amazon Web Service, EC2.

EC2

This StackOverflow post described my situation perfectly and there is a handy guide to starting up your EC2 instance to create your deployment package. There lay some trouble in the fact that this post was written for Python 2.7, whilst my project and Lambda were configured for Python 3.8. After I installed 3.8 on the EC2 instance using this guide (you can only install up to 3.6 using the normal sudo yum commands) I also ensured the pip version was up to date. This enabled me to create a working deployment package! Yaay! But not yaay. I accidentally posted my API key to Github again and my account was blocked (I still don't know how I did this again as it was in the .gitignore, but oh well). Another 3 weeks went by as I appealed to have my account reinstated but to no avail so I used another Google account that I had to get a new API key.

CORS

I felt so close at this point, as my deployment package was working, testing my endpoint through Postman and API Gateway returned the response I wanted. My final hurdle was CORS. For those of you who don’t know, CORS stands for Cross-Origin-Resource-Sharing and is a mechanism that allows resources on a webpage to be shared by an external domain. The below image describes the mechanism of CORS for browser security.

My problem lay in the bottom right corner of the above image, as I couldn’t get the server to respond with the appropriate headers to allow the request to be made from any origin. After probably 6–7 frustrating hours of seeing this incredibly frustrating error, I managed to configure my API to respond with the correct headers and return my values. The returned values are then injected directly into the fields and the work is complete.

Benefits

At 3.5 seconds from execution to completion, running my tool significantly speeds up the process of keying the forms. We can allow several more seconds to change any errors that have been made by the OCR, but this still falls far short of the 35 seconds it took me to manually type the data. Additionally, it is easy to setup as a backend service or everything could be done client-side.

Drawbacks

My tool is not 100% accurate, and will not be able to read much of the illegible handwriting that comes our way. It's smart but doesn't have the thousands of years of pattern-recognition that our brains are trained to detect. Through improvements in machine-learning, this accuracy will increase in the future and the OCR will be able to recognise increasingly unintelligible writing.

This tool uses a 3rd party API to process the forms. These forms contain sensitive data of the electoral roll and the commission would have no say in how this information was maintained or how often it was “torn-down” from Google’s logs. Google is also offshore from New Zealand, so the government would not be able to regulate the use of this PII.

Conclusions

The ideal situation would be that the Electoral Commission develop their own OCR tool that all data would flow through, however, this would require many resources and would likely produce inaccurate results (until the Commission develops pre-trained machine learning models as Google can, I doubt a highly accurate tool could be built). Another alternative would be that the Commission enters into an agreement with Google in the regulation and preservation of private data to enable the use of their Vision API.

Even if this project goes on a shelf to gather dust and the dev team for the Commission does nothing with it, I thoroughly enjoyed this project. I built upon my Python knowledge, gained experience working with multiple AWS tools I had never used before, and had exposure to Google’s powerful APIs. I also learned the importance of the .gitignore file to stop me from pushing private credentials to public spaces (after 7 total weeks of being locked out of my Google Cloud Platform account, I’ve definitely learned my lesson). If you fancy checking out my project feel free to visit my repo here.

--

--

Rhys Thomas

Software test engineer that dabbles in functional programming