Chinese ID Verification: OCR Challenges

HomeBlogChallenges of ID Verification in China: Script Reading

Contents

The process of Chinese ID verification is filled with challenges, one of which is the unique script the Chinese language boasts. While it’s convenient that all IDs use Simplified Chinese, as opposed to a range of regional languages, even this version of the language can be hard to process optically.

What difficulties can arise, exactly?

In this article, we will answer this question and break down the OCR (Optical Character Recognition) problems caused by the complexity of the Chinese script.

Get posts like this in your inbox with the bi-weekly Regula Blog Digest!

The current state of OCR for the Chinese script

In one of our previous articles, we underscored how the sheer diversity of ID documents in circulation is a big challenge for Chinese ID verification. And this is not the only challenge: the script used in these documents can be too sophisticated for some automated OCR systems to correctly process. What’s more, the script tends to exacerbate many technical problems relevant even to the Latin script, such as the lighting or focus.

Script-specific OCR challenges

Unlike English or other Latin-alphabet languages, Chinese is written in logographic characters that are dense with strokes. A single character can contain a dozen or more distinct strokes in a small font size on an ID card, so characters can smear together, making it hard for software to distinguish similar-looking ones.

For example, used in both surnames and given names, 贝 (bèi, “shell”) is simple and highly symmetrical. However, OCR can still confuse it with 见 (jiàn, “to see”) if the lower sweep is somewhat faded. 吉 (jí), which is also very common in both surnames and given names, contains 士 (shì, “scholar”) above 口 (kǒu, “mouth”). If stroke boundaries bleed or contrast is weak, OCR may mistake it for 卡 (kǎ) or 哲 (zhé), which share overlapping parts.

As for more complicated situations, parents sometimes choose auspicious characters like 淼 (miǎo) for names, which consists of three 水 characters and stands for “vast expanse of water.” An OCR system could mistakenly segment 淼 into multiple characters if it doesn’t recognize the triple-stack pattern, reading it as a sequence of character components (氵氵氵) or misidentifying it as other water-related characters.

Another critical issue stems from the fact that some IDs are printed in both Chinese and English. More specifically, Chinese passport data pages show fields in both Chinese characters and English, and even mix scripts in one field (dates are printed as digits with the Chinese character 月 for month).

This is important, because during an ID check, a document reader will attempt to match the given name written in Chinese to the English transliteration of the name found in the machine-readable zone (MRZ). If either the Chinese script or the transliteration is ambiguous, the check may fail: the name “张伟” with two characters might get transliterated as “Zhang Wei” or “ZHANGWEI”. That’s why OCR solutions must have context-aware transliteration logic and language-specific matching.

During MRZ reading in Chinese passports, IDV software must convert the Latin script into the original Chinese name and match it against the visual data.

What’s more, national identity cards issued in certain autonomous regions add a second language (naturally, not English this time): for example, Guangxi ID cards include Zhuang script, and Xinjiang IDs may include Uyghur (Arabic script).

ID cards for ethnic Mongolians (left) and Uighurs (right) issued by Chinese authorities display data in both the holder’s native language and Chinese characters.

Technical OCR challenges (compounded by the script)

Regardless of language, the image capture conditions and document design still play a massive role in OCR accuracy. For Chinese ID verification, this is especially true because of the fine details.

Lighting is a common issue, as harsh reflections or shadows can easily ruin text visibility. Chinese IDs and licenses are often laminated or coated, so overhead lights can easily mirror on the glossy surface. On the other hand, low light conditions introduce noise and require longer exposure, often yielding blurry images if the hand isn’t perfectly steady.

Lighting is also often an underlying reason for another problem: security features interfering with OCR. The Chinese driver’s license, for example, is highly glossy with vibrant holograms, and even slight tilting causes bright reflections that OCR may interpret as light patches or random shapes across text. Similarly, ghost images, guilloché background patterns, microprinted text, or UV markings can all reduce the contrast or add clutter for the OCR software.

This is why high-end document readers like the Regula 72X3 use multiple light sources (visible, infrared, ultraviolet) and take multiple images, which software then analyzes. However, for a mobile OCR solution, you may have to rely on the single RGB image, so it’s all about optimizing how that image is captured to minimize security feature interference.

An ID document must also be framed correctly and remain in focus, as a blurry or angled image can cause the OCR to misinterpret lines: for example, 千 (qiān) can be read as 干 (gàn) or 于 (yú).

Last but not least, the physical condition of an ID is a factor. IDs can be scratched, scuffed, faded, or stained; and any such damage will impact OCR. And in the case of Chinese, a scratch across a word might easily remove a line. Moreover, dirt or smudges can look like false strokes, potentially causing false readings.

Meeting Chinese ID verification challenges with Regula

Given the above challenges, it is hard to find a solution that will work flawlessly under any conditions, especially in the case of Chinese. The best results in this case will come from a combination of an advanced OCR engine (with support for Chinese) as well as an extensive document template library to help the engine interpret the fields correctly.

Regula provides both parts of the solution: Regula Document Reader SDK supports over 138 languages (including Chinese) and more than 600 data fields, while our template database is the biggest in the world, with 16,000 documents from 254 countries and territories.

In addition, the SDK supports full UI localization for 35 languages (including Chinese-language interfaces), which helps local deployments.

With Regula Document Reader SDK, you will be able to:

Authenticate thousands of ID documents from all over the world, including China.
Read machine-readable zones (MRZs) and barcodes.
Read and authenticate RFID chips.
Verify digital signatures encrypted into barcodes using the ICAO Datastructure format.
Verify dynamic security features, including holograms and optically variable ink (OVI).
And more.

Let’s drive the future—together. Book a call to learn more about our solutions!

Verify IDs in seconds with Regula SDK

Instantly verify passports, ID cards, driver’s licenses, and more—powered by the world’s largest database of document templates.

See all features

Challenges of ID Verification in China: Script Reading

The current state of OCR for the Chinese script

Script-specific OCR challenges

Technical OCR challenges (compounded by the script)

Meeting Chinese ID verification challenges with Regula

Verify IDs in seconds with Regula SDK

Related articles

Document Parsing: How Professionals OCR ID Documents

Real-Life OCR Challenges for Arabic ID Processing

Why Countries Update Their ID Cards: 4 Main Reasons With Examples

A Brief Guide to KYC Requirements in the UAE