As you may know, we have been developing and improving our intelligent document processing platform, Klassif.ai, for a while now. Our objective is simple: we want to unburden you of repetitive, manual tasks like processing documents so you can spend your valuable time elsewhere. However, building a platform that can help you with this isn’t as easy as you might think, especially since we wanted to process any document.
Yes, you heard that right: any document. We wanted to process it all, including handwritten documents, particularly since one of our customers could benefit from a solution, namely HVW/CAPAC. This governmental organization processes many unemployment benefit cards, which contain several handwritten fields such as a social security number, date, address, etc. Processing these documents takes enormous amounts of time. Luckily for them, this is where Brainjar comes in to make this process more efficient!
In this blog, we will show you how we developed a solution.
We developed Klassif.ai so that our tool could extract relevant information from digital documents by accessing the digital text layer. These kinds of documents are the easiest to process.
In addition to processing purely digital documents, we had to extract information from scanned documents. For this, we used Tesseract, an optical character recognition (OCR) engine that can recognize more than 100 languages. We could already help our clients process a lot by processing these two kinds of documents. However, we wanted to make our tool much more powerful. So, we had to tackle a third category that no one likes to talk about: documents containing handwritten text.
As we have mentioned, this is not a quick and dirty implementation. As OCR engines cannot work with handwritten text, we had to think out-of-the-box and build a custom AI model. Creating an algorithm requires loads of training data.
Based on our initial estimates, we needed approximately 400.000 handwritten documents. My hand suddenly starts to feel like it’s developing carpal tunnel syndrome by just typing this number. ☹️ This kind of data is not immediately available, and I wouldn’t want to be the one who’s writing those documents. Yeah, we had to find a better solution for this.
Down the rabbit hole, we went. We started thinking: what if we make a tool that can generate images containing handwritten text snippets? First, we would have much more control over our training data this way. In addition, our concerns about data volume would be in the past. With such a tool, we could ask the system to generate a certain amount of training samples, and the AI would generate the labeled training data for us.
Labeled? Yes! This working method enables us to generate the necessary labels for Klassif.ai automatically. For instance, the label is implied when you tell the system to write a handwritten date in DD-MM-YYYY. This means that instead of processing and labeling 10.000 documents manually for days on end, we could do this in a matter of minutes. As a result, we have our handwritten text, we know exactly what kind of text is written, and we can see what the according label is. Same result, less effort.
Easier said than done, of course, because if you can’t design an AI system to recognizehandwritten text, how on earth are you going to make an AI that generates text?
We’ve got our idea. Now, we have to get to the result (a labeled dataset with handwritten text) to begin generating our training data. We’re ready to get to the AI side of our solution.
First, we have to decide on a model structure and train the AI models to write sentences themselves.
In our model structure, the generator is the most crucial part - as it is the part we will use in the end. To give the generator input, we fed the target image to a text recognizer (to deal with spelling) and a style extractor (which can extract whether the text is written in cursive, capitals, or other stylistic features) first, which respectively deliver spaced text and style vectors. When the generator does its thing, a generated image is again fed to a text recognizer that checks spelling, a discriminator that checks for adversarial loss, and an encoder that checks for perceptual loss.
The discriminator's adversarial loss, on the one hand, is meant to distinguish between fake and real samples and is an excellent judge of the generator's performance.
On the other hand, perceptual loss ensures that the generated images are not too different from their starting state by pixel, constraining the network's results visually.
Taken together, the task of the generator is generating images that look like the original input. The other parts of our model check if this is done right. We do this for millions of samples, so the AI model is trained thoroughly.
Hooray, we've got our solution. Now, we can generate handwritten text to train the Klassif.ai platform. The beauty of this solution is that we can extend it - if a client needs to, for example, filter the dates of a document. Still, they all are written differently (in full, shortened, American, British...), we can train our model to recognize these types of inputs.
Would you like to know more? Book a meeting with us!