TLDR; We had a real headache building a named entity annotation tool for our Intelligent Document Processing platform, Klassif.ai. After three iterations, we finally made an annotator that ticked all the boxes. And the best part? We made that puppy open source!
Klassif.ai is an intelligent document processing platform. In this blog post, we will discuss what our journey toward an own annotator looked like. Especially: where we ran into doors and how we found ways to open them. Bear with us!
When first building Klassif.ai platform, one of the biggest challenges besides AI was the actual annotator. For those who are wondering, an annotator is the part of the application where a user will highlight, or annotate, certain entities. Named entities are predefined categories like a date, reference or whatever... Basically, the stuff our AI needs to find in PDFs.
As you probably know, the web and PDFs aren’t the best of friends… Add interacting with those PDFs and you are giving yourself a headache! 🤯
Way back in Klassif.ai's pre-alpha days, our approach was to take text from PDF’s, which was done by a Python backend service. Subsequently, we would just build an annotator with the plain text we extracted. This wasn’t too hard to build and the result was surprisingly passable.
"Problem solved!” Or so we thought…
While we were building the platform, we conducted some user tests and made them try the initial version. Feedback is key, right? One of the first pieces of feedback we received is that most users had trouble finding the information in the plain text version. Why? They were only familiar with the layout of the initial PDFs. By just glancing at a specific PDF, they used to know where to find the information they needed. In conclusion: not the ideal solution.
Klassif.ai wants to facilitate manual document processing, not hinder it! So, back to the drawing board and, of course, back to the headaches. 🧐
Okay, so, what exactly is the problem here? You just want to show a PDF in a browser, right? That doesn’t sound too hard… I mean, PDF viewers for browsers have existed for quite some time now.
However, what we needed was not as simple as that. In fact, we not only want to show a PDF, but we also need to be able to interact with it. In addition, we want to annotate the information.
In other words, we want to know the position of words in the text and visually add highlights to the text when a user selects a certain part. To that end, we need the PDF in a format we can work with and, as we are working with React (which is Javascript), we need the JSON format. However, a PDF is the furthest thing possible from the JSON format you can think of. Actually, I bet we are not the only ones in the world who want to convert PDF to some other format like JSON, or HTML… I even bet the open source community has my back, so there is no need to reinvent the wheel. Time to consult the elders of Google!
One quick Google search later, Mozilla's PDF.js framework appears, which looks promising. PDF.js can turn a PDF into HTML format. Not quite what we need (which is JSON), but it’s a start… A bit of research into PDF.js later and it seems like a dead end. It’s perfect for displaying PDFs in an HTML format, but we need more.
So, I’m sorry, PDF.js, but you didn’t meet the requirements. This is where we went our separate ways. However, at this point, I didn’t know PDF.js would be back to claim its revenge…
My search continued. That's when I stumbled on Poppler, which has a PDF to XML function. However, it’s not quite what I want and it also doesn’t work with Javascript... Luckily, it has Python bindings, and Python has a package called xmltodict, which turns XML into a Python dictionary. To summarize, that’s basically JSON. It’s certainly not the perfect solution, however, our problem seems to be very niche and beggars can’t be choosers… So, I guess we are going forward with this solution.
How does this work? Well, we use Poppler in Python to take a PDF and turn that PDF into XML. This process is for the text layer. More specifically, it takes the text layer from the PDF, goes over the paragraphs and translates the position of a paragraph to a top left coordinate. Finally, it adds the font specs (font family, size, weight,...) giving you an XML that can look like this:
<?xml version='1.0' encoding='UTF-8'?><pdf2xml producer='poppler' version='0.76.1'><page number='1' position='absolute' top='0' left='0' height='1262' width='892'><fontspec id='0' size='11' family='Arial,Bold' color='#000000'/><fontspec id='1' size='10' family='Arial' color='#000000'/><fontspec id='2' size='39' family='BCC393to1Narrow' color='#000000'/><fontspec id='3' size='10' family='Arial,Bold' color='#000000'/><fontspec id='4' size='12' family='Arial' color='#000000'/><fontspec id='5' size='12' family='Arial,Bold' color='#000000'/><image top='50' left='82' width='240' height='126' src='PO-UAN-00007372-1_1.png'/><text top='54' left='566' width='273' height='12' font='0'><b>KATOEN NATIE COMMODITIES ANTWERP</b></text>...</pdf2xml>
Now, the last step is to take this XML and use the xmltodict package to convert it to a Python dictionary, which we can then save as a JSON in our database. In addition, Poppler extracts tables, images, figures, etc.
Using these background images and the JSON, we can visualize the PDF in our React application very closely to how the PDF actually looks. It’s not a perfect one-to-one match with the real deal, but it does the trick.
Great! We can visualize our PDF in our frontend. On to the next challenge: annotating the text…
Because of the main feature of our application, we cannot just use the logic from our text based annotator, which used text selection. This is because, for example, an address is often written on two lines: the first line contains the street and number and the line beneath it contains the postal code and city.
The problem is that if you have an entity ‘location’ and you want to annotate it using text selection, you simply can’t, because text selection is done over paragraphs. Between the first paragraph of the location and the second, a whole bunch of other text is present. Our solution to this problem was the annotation mode. When you select an entity, boxes around the words appear. When you click on a word, it is selected to be annotated as the current chosen entity. When you press enter, you confirm the selection and it is annotated.
Perfect! We did it! Hooray... We have the perfect annotation tool. Except we don’t… There are still a lot of drawbacks: you can’t select text, because that messes with annotation mode. In addition, you can’t zoom in or out on the PDF, because our background images are static. However, the worst of all is that some entities require you to annotate a lot! Which means you have to click on a lot of words, which is very annoying... And with large PDFs, the tool is extremely slow. If you have to annotate on more than 5 pages of text, it is nearly unusable.
Right now, the new annotator has been used a couple of months in production and we don’t get too many complaints. It has its limitations, but our users can work around those or aren’t bothered too much by them.
Except for one thing… Not being able to select and copy text from the PDF seems to bother our users a lot. We often received this feedback, and as the customer is king, a ticket appears on the backlog: Klassif.ai needs a PDF viewer that supports copying text. Fine, we’ll add a more native-feeling PDF viewer! You can probably see the tool we used coming from miles away. Eventually, we came crawling back, begging for help from PDF.js.
I get to play around a bit more with PDF.js and at this point, I even start to see the real power of PDF.js... I might have made a mistake thinking it wasn’t ready for the job.
You see, PDF.js renders the PDF to a canvas. This means it is a perfect representation of the PDF. The text layer is taken from the PDF and then rendered as a best-effort hidden text layer in HTML over the canvas. It turns out this text layer is stored internally, as you could probably guess, as a JSON object. Well, do I feel stupid now 🤦🏻♂️…
Time to turn stupidity into opportunity! Let’s see if we can use this to build an even better annotator. To the Batcave! I mean: the drawing board.... First, let's take a good look at PDF.js's documentation. Errr, that’s right, one of the reasons I wrote it off so quickly is because there is barely any documentation. Well, at least it’s open source, so I guess I will venture into the depths of the source code!
Firstly, let’s start by having a look at how the JSON for the text layer looks like:
{ "dir": "ltr", "fontName": "g_d0_f1", "height": 10.08, "str": "Brainjar nv ", "transform": [ 10.08, 0, 0, 10.08, 352.56780000000003, 706.0364999999999 ], "width": 48.498912}
Looks very similar to Poppler. We have a fontName, which refers to an object in the styles array. We know the height and the width of the text and, of course, we get the text itself. There is also an array called transform with some random numbers. A quick look in the documentation and we know… Oh, yeah right, never mind. A deep dive in the code it is.
PDF.js is written in plain JS, but there is a types package. In that types package I found the following interface:
interface TextContentItem { str: string; transform: number[]; // [0..5] 4=x, 5=y width: number; height: number; dir: string; // Left-to-right (ltr), etc fontName: string; // A lookup into the styles map of the owning TextContent}
Great! We've got a couple of comments. The transform array is still a bit mysterious, but at least we know that the 4th element is the X position and the 5th element the Y position. That's perfect, I have everything I need. Now I just need to render the text with all these values and... boom! I have a perfect text layer, just like their demo.
Hold up. My text layer doesn’t seem to fit. Actually, everything is off… Why?
When inspecting the demo of PDF.js and looking at the elements they generate for the text layer, it becomes clear why.
<span style="left: 134.193px; top: 131.593px; font-size: 29.888px; font-family: sans-serif; transform: scaleX(0.970009);">Trace-based Just-in-Time Type Specialization for Dynamic</span>
Look at the transform property where they scale on the X axis... Turns out, my code doesn’t do that. But where do they pull that magic number 0.97 form? After some searching in the code, I found this piece:
const { width } = this._layoutTextCtx.measureText(textDiv.textContent);if (width > 0) { textDivProperties.scale = textDivProperties.canvasWidth / width; transform = `scaleX(${textDivProperties.scale})`;}
Bingo! So, just measure the text width and divide the canvas width by the width of the text. With this added, the text fits correctly.
This could actually work! Time to turn this bad boy into an annotator that addresses all the issues the current one has. First, I have to find a better method of annotating. Text selection is still a no go, but there has to be a better way than clicking words. Eventually, we came up with a selection box... It’s brilliant. Actually, it’s simple, yet elegant! Just like taking a screenshot, what the user selects will be annotated.
This feels way more intuitive, simpler and efficient. In addition, we kept the ability to just click a word for single word annotations. This still makes sense for one word, rather than having to drag for a selection.
The other big issue is performance. The old annotator just can’t handle PDFs with lots of pages. This actually is something that is easy to fix, even in the old annotator. The big problem is we are rendering ALL the pages, which is totally useless. Even if you zoom out very far, you will probably never see more than 3 pages at once on your screen (even in reality, you will see just two). So, just rendering what the user sees, fixes this problem. And as I am completely rewriting the code, I will add this to the new annotator as well.
The last problem was the inability to zoom. Well, PDF.js has support for that, so we solved that issue for free by switching from Poppler to PDF.js! Well, not totally free, there was a bit of logic necessary to recalculate the position of size of annotations, but that was pretty easy.
You know what, this new annotator is amazing! However, it could be even more amazing... The old annotator only worked if a PDF has a text layer. Scanned documents? Yeah, sorry, we can’t handle those. Sounds bad for a document automation tool right...
Python has bindings for Tesseract, an open source OCR project, but I just managed to ditch Poppler, which was in Python. Actually, I don’t want to get back to a Python dependency. However, JavaScript has a big community... Surely someone has made a Tesseract port for JS, right? And turns out: of course, someone has! Awesome! I love the open source community.
I’m not going to go too much in depth on the integration of Tesseract.js into the annotator. (Basically, Tesseract.js can take an image or canvas or something alike as input and run OCR on that. Then it will give you a JSON as output, we just transform that JSON output to look like the JSON PDF.js uses for its text layer and badabing badaboom, we have a text layer for PDFs without a text layer.)
This blog post is already getting pretty lengthy, so I’ll keep the conclusion short. Making a PDF entity annotation tool for the web wasn’t easy. There were a lot of issues and lots of lessons learned. In addition, making this annotator would not have been possible without all the awesome open source projects available on the web!
Because we, at Klassif.ai, want to give back to that awesome community (and secretly also because we are very proud of this piece of software), we have decided to completely open source our annotator component. So, if you are looking for an awesome PDF annotator: we’ve got your back!
Go ahead, it’s free! Enjoy it!
P.S. Feel free to contribute or give feature suggestions for our annotator!