Language

25 Jan 20237 min readin ID verification & biometrics

Document Parsing vs. OCR: What’s the Most Effective Way to Process ID Documents

Ihar Kliashchou

Chief Technology Officer, Regula

Imagine a bank clerk and a long line of customers who want to open an account. Not a complicated procedure, but time-consuming because of identity-related paperwork. The clerks need to pull data out of a document, make sure it’s valid, and enter it into their system. Doing it manually isn’t an option as it will take ages.

The first thing that comes to mind is using OCR technology to automate the task. Still, the shortest path isn’t the fastest one. In this article, we’ll consider different approaches to capturing data from IDs and describe the most effective way to do it.

Stay Tuned

We'll deliver hand-picked content from Regula's experts into your inbox

What is OCR and why is it of little help when it comes to IDs?

Optical character recognition, or OCR for short, is a technology that can turn an image of text into actual text which you can further edit. For example, a scanned passport is an image, so you can’t just press Ctrl+C and Ctrl+V to copy the details and paste them somewhere else. OCR distinguishes text characters within images and converts them into text format, thus helping you avoid manual data entry.

A great example of OCR is the Live Text feature available with recent Apple devices. It allows you to fetch text from photos and images with a tap, which comes in handy when you need to extract a phone number or just add something to your notes.

a real-world example of using OCR in mobile devices

A really convenient feature, though the resulting text requires a bit of editing

However, what might be a solution for a single task won’t always help you cope with the same task at scale. OCR turns images into machine-readable text, and that’s it. If you have a lot of documents (especially if those documents contain various data) you’ll likely end up with a messy heap of text that’s hardly actionable without some prior brush-up.

Well, OCR providers didn’t ignore this problem and offered a solution.

OCR templating

Many OCR providers offer a feature known as OCR templating. OCR templating allows you to create document maps, or templates, using a set of your most common documents as a foundation. With these templates, the computer will know where important elements are located on the page. For example, say in an invoice, the date is always in the upper right corner, and the total amount appears at the bottom.

OCR templating example

The OCR engine does not have the intelligence to identify the specifics; it treats data as characters

OCR templating is often used in robotic process automation (RPA) solutions. RPA is a technology that allows you to automate repetitive tasks that follow the same rules over and over again with the help of software bots. If you were to build a bot for paying the same invoices every week, OCR templating would be a perfect candidate for that kind of task.

With OCR templating you get more actionable output than with ordinary OCR because it allows you to mark up and pull out information in a more structured way. However, there are a couple of buts to consider when you work with identity documents.

Amount of manual work. To make OCR templating work, you need to manually mark up the data in every document type you need. So, if you need to process passports from different countries, you’ll have to mark up a passport from each country to create its template. Add other identity documents, such as ID cards, to this, and you’ll have an even larger amount of painstaking labor.

The need to maintain templates. The created template only works as long as the document itself doesn’t change. If there are new fields, or they are placed in a new location, you need to update the template.

No verification capabilities. Worse still, the challenge with ID documents isn’t just getting some data from them. It's about making sure the data and the document are valid. No OCR solution, which simply recognizes text, can help you with that.

OCR cannot read non-text data sources. ID documents contain information not only in text format, but also encrypted in barcodes, RFID chips, and machine-readable zones (MRZs). For conventional OCR tools, it’s impossible to read and verify these elements.

Document parsing: The most effective way to extract and verify data from identity documents

As we said above, the thing with ID document processing is that conversion from an image to text isn’t the only challenge you need to solve when trying to automate your workflows. Further, we’ll dive into how you can get all kinds of data from IDs (not only text fields), automatically structure it, and enable accurate data queries.

To illustrate the point, we’ll use the data parsing capabilities of Regula’s document parsing software, which is purpose-built for reading identity documents.

How does ID parsing work anyway?

The point of ID parsing is that we don't just return raw user data, as an OCR reader would do. Instead, document parsing software also structures and additionally analyzes it. Generally, the process of document parsing consists of five steps. Let’s have a look at it in comparison with OCR.

OCR softwareDocument parsing software
  • Scanning a document;
  • * If OCR templating is supported: Choosing the type of document from a set of manually pre-made templates;
  • * If OCR templating is supported: Reading the fields that are defined by the template;
  • Delivery of the output.
  • Scanning a document;
  • Automatically identifying its type by comparing the document against a database of document templates;
  • Reading and validating the fields that are defined by the template;
  • Structuring the output;
  • Document verification.

 

While the first three steps of the document parsing process somewhat resemble the principles of OCR templating, there can be major differences, depending on who created the document templates, the number of templates, and how well they are done.

When using an in-house OCR solution, the number of templates is usually limited to the few most common ones. In contrast, Regula’s document parsing software leverages the world’s largest document templates database, which currently includes over 13,000 templates of passports, ID cards, visas, driver’s licenses, and other documents from all over the world. It saves you a tremendous amount of time, as you don’t need to create any ID document templates on your own. But it’s not only about saving time.

The depth of template detail is another point in favor of using specialized data parsing software. To create a reliable template, you need to have information about all the possible variations for each field in the document. This isn’t something you can do with just a couple of samples at hand.

For example, the expiration date on ID cards is usually written as a date. In some countries, like Bulgaria or Vietnam, there are the words “No expiration date” (or words to that effect in the respective language) for people over a certain age. If you don't know these peculiarities, the template becomes useless.

Before a new template gets in the database, Regula’s in-house forensic experts scrutinize it and report every security feature using advanced proprietary equipment. Thanks to this, you can be sure that the automated check matches the quality of a lab examination (but much faster). Even if your customer submits a document you’ve never seen and in a language you don’t speak, Regula will be able to recognize it in a moment and tell you what it is and what its characteristics are.

Regula Document Reader SDK

Automate identity document reading and authentication.

Learn more

Can document parsing software verify documents?

It depends on the level of analysis depth you need, but the short answer is: yes, it can.

The Regula data parser starts with lexical analysis and validation that every field in the document says exactly what it should say. It checks if the expiration dates are valid, and flags, for example, if it’s 2022 but the document expired in 2021.

The lexical analysis also includes mask violations: say we expect a field to contain a date, but there’s no date or the field has another value. There is also an analysis for stop words: the provided documents shouldn’t have words such as “sample,” “specimen” or “test.” All this happens automatically and is indicated in the field statuses.

Lexical analysis detected the word “specimen” in a passport

Lexical analysis detected the word “specimen” and marked these fields as invalid

Also, identity documents can have four types of data sources: visual inspection area, MRZ, RFID (radio frequency identification) chip, and barcodes. The data in different sources is often duplicated.

Unlike an OCR solution, Regula’s data parsing software reads all the sources and automatically compares all similar fields. For example, it can take a person’s last name from the RFID chip and compare it to the last name written in the MRZ and the one in the visual inspection zone. If anything doesn’t match, Regula will mark this field as invalid. So, if someone altered their name in the visual inspection zone (relatively easy to do) but failed to update the chip (a way harder thing to do), it’ll be detected.

Another example of verification that became in demand recently is decoding Visible Digital Seals (VDSs). A typical VDS looks like a QR code. For example, it was used for issuing Covid-19 vaccination certificates. It’s also used for visas: starting from May 2022, all Schengen visas have VDSs. Regula can effectively analyze the digital signature that the VDS barcode contains and verify that it was signed with a specific certificate issued by a specific country, not just randomly generated.

A tricky aspect of reading barcodes and MRZs

There’s one thing in common between barcodes and MRZs—their diversity. If your solution lacks the knowledge of at least a few of their possible variations, it’ll lead to an influx of false positives and eventually undermine trust in the check.

There’s a huge variety of barcodes, and each type requires its own parser for processing. Regula uses 220 different barcode parsers built for specific documents to handle this task. The main difficulty with this is that you need to understand what format is used to encode the data to be able to distill it down to text fields.

For example, USA and Canada driver’s licenses often contain a PDF417 barcode, but you can’t just read it without knowing what format was used for encoding. In the case of this barcode, it has a header that says what parser type to use. Regula gets this information and then uses this parser to decipher the rest of the barcode body into specific text fields marked with their types for the end user.

But that’s not all: AAMVA, the American standard for encoding, also has numerous versions that have been used to fill in documents over time. This fact must be taken into account if you want not only to read encoded information but also verify it.

example of reading and verification of the PDF417 barcode in a driver’s license

A PDF417 barcode, when properly handled, contains plenty of information for verification purposes

The same goes for the MRZ. There’s a conventional MRZ specified by the ICAO 9303 standard. In theory, everyone should adhere to it, but the reality is that many countries bring their own innovations. Romania, Kuwait, and the UAE, for instance, count checksums differently, thus creating a fork for the standard MRZ. With the most comprehensive document template database, Regula effectively handles these nuances too.

Structuring data makes it actionable

The highly structured output is one of the biggest pros of applying ID parsing for processing ID documents. When working with documents at scale, the last thing you need is a pile of siloed personal data to manually sort and label before you can further use it.

With Regula, all the data it reads and analyzes is divided into groups and fields. Each field is assigned a type. Thanks to this, users can scan a document and instantly pull out the specific information they need: for example, they can request only the full name.

It is worth mentioning that there can be several fields of the same type in one document, but in different languages (say, Latin characters and a national language). Regula allows you to retrieve the data regardless of what kind of document it is and where exactly it is located in the document, and in the language that is relevant for your purposes.

When we say “data,” we don’t necessarily mean text data; it also refers to all relevant graphical images. This includes the image of the document itself and cropped-out visual elements, such as the portrait photo, ghost portrait, and signature. Regula saves each element separately, so it’s ready for further use. For example, you can use the portrait extracted at this step to conduct face matching checks.

cropped out visual elements of a Slovakia ePassport

Not only text fields but also visual elements of IDs are quickly accessible with document parsing

Structuring is also about converting data in different measurement systems (metrical/imperial) and date formats (yyyy/dd/mm, dd.mm.yyyy) into a unified format set by the user. This allows you to provide values that users are familiar with (say, converting a Thai year to the Gregorian calendar) and immediately compare apples to apples at verification checks without any workarounds.

The main idea behind ID parsing is to quickly deliver ready-to-use results. You quickly get the analysis, make sure the document is authentic, quickly fetch information from certain fields, and quickly scan and digitize the document to fill out a form in your internal system.

Not only is it convenient, as it speeds up the workflow, but also informative and secure. Having provided an executive summary of the analysis, ID parsing allows you to dig deeper into each check down to the raw data.

The bottom line

Data can hardly be used in its raw state. Once collected, it needs to be broken down and analyzed to have value and eventually turn into decisions. While OCR is a great technology that has revolutionized data collection, it’s no longer enough to deal with identity documents effectively.

That’s where ID parsing comes in. When backed up with solid expertise in protected document forensics, as in the solution Regula provides, it helps you solve most challenges with identity document processing.

Key takeaways

  • OCR is a technology that can turn an image of text into actual text which you can further edit. The output is usually unstructured data.
  • OCR templating allows you to manually map elements in existing documents to turn them into templates and then pull out the text from similar documents from the specified areas.
  • The challenge of getting the data from identity documents is that it comes not only in text format but is also encoded in machine-readable zones, chips, and barcodes. That’s why you need more than OCR to tackle this task.
  • The point of identity document parsing is to return structured, analyzed, and verified data that is ready for further usage.

  • Make sure the provider you choose for ID parsing has proven extensive expertise in identity document forensics.

On our website, we use cookies to collect technical information. In particular, we process the IP address of your location to personalize the content of the site

Cookie Policy rules