Kofax Transformation Modules – format locators and dynamic regular expressions

9.1.2013 | 5 minutes reading time

Part 1: An introduction to format locators and regular expressions

Many of our customers are using systems for automatic document classification and data extraction. These data capturing systems extract metadata out of the electronic images (these are the scanned pages of the documents, faxes or emails) and release the data and the document to business applications. A core part of these systems is a technique called freeform field extraction. Freeform extraction means the search for metadata (for example an insurance number) is working independent of the document layout.

This is the main principle of freeform extraction: each value, which we try to extract, has a special syntactical structure. As an example, the insurance number of an insurance company could have the following structure: YYYY/1234567890 (four digits as the year / maximal 10 digits as a number). Examples: 2012/45 or 2011/47123.

This insurance number may be written somewhere on a document. But this number is not written without a certain context, as the customer or the clerk has to identify this number too. Therefore you will find words near the number as “insurance number”, “ins.no.”, “Ins.nbr.”, “contract number”, … There is a geographical relation between the number and its describing text. This text may be written to the left, to the right, above or under the insurance number. Furthermore the distance between text and number may be used as an attribute for the extraction.

codecentric is using ‘Kofax Transformation Modules’ (KTM) as one product for automatic classification and data extraction. KTM can be integrated as a module into the capturing solution Kofax Capture (see Stefan Blank’s Blog ).

KTM uses internal tools called ‘format locators‘ for the identification of values. Within such a locator, you define the structure of a value (insurance number), the describing text (“insurance number”, “contract number”) and the geographical relation between value and text.

Here is a snippet of an example document with an insurance number (unfortunately a German document):

*** Remark: Versicherungsnummer = insurance number ***

A format locator for the extraction of the insurance number could be defined as follows (screenshots are from the KTM Project Builder):

There is a so-called regular expression, which describes the general structure of an insurance number: 20\d{2}/\d{1,10}

Year(four digits) / 1 to 10 digits: 2011/47123

Exactly this is described by the regular expression:

20 are the first two digits of the year

\d{2} represents exactly two digits

/ represents the character /

\d{1,10} represents a number with 1 to 10 digits

The expression 20\d{2}/\d{1,10} will find all matching strings somewhere on the document. Besides the insurance number these could be other strings, which match the regular expression (phone numbers, bank codes, …) In order that only the insurance number will be taken, the describing word(s) have to be defined within the format locator:

KTM-FL-EVAL-EN75

The line:

means for example: the term “contract number” must be found to the west (left) of the matching number in a ‘near’ distance. If the term is found there, it scores 100 points . You can add all terms that may describe an insurance number.

By the combination of the regular expression with the describing terms, KTM is able to read the insurance numbers out of all documents and to refuse the improper matches – independend of the number’s position on the document. The winner is the match with the highest scoring (points).

You can test this within the KTM Project Builder just by pushing the ‘Test’-button:

In a real customer correspondence to an insurance company the insurance number may be written in several different notations. Instead of 2011/47123 you may find 2011-47123, 2011 47123 or even 201147123. In order to mach these numbers with the format locator, the regular expression will be changed slightly in a real environment.

All of the above notations will be found by this regular expression:
20\d{2}.?\d{1,10}

The point in the middle of the expression represent any single character. The following question mark declares the preceding character as optional. With this definition KTM will find all of these:

2011/47123
2011-47123
2011 47123
20114712

In real customer projects the extracted insurance number will be checked against the contract database. If the number exists, the number and the document (maybe with other extracted metadata) will be electronically routed to the relevant clerk or business application. If the database check was not successfull (or an insurance number was not found) the document must be validated manually. KTM provides a validation modul for this purpose, which can also be integrated into the Kofax Capture workflow.

Not to long ago, I was thinking that it is possible to extract all metadata out of a document with KTM by using format locators and regular expressions – as long as the document is not handwritten. Recently we had to setup a document classification/extraction project at a scan service provider who works for financial institutions. The challenge was to develop one project work for several clients. We had to deal with document types, where the described ‘static’ format locators could not deliver sufficent results. We were in need of some type of a format locator whose regular expression could be modified during runtime (depending on client specific data). As KTM provides a VB-compatible scripting language and due to some knowledge of the KTM object model, we were able to master this challenge.

The second part of this blog series will cover a way how you can dynamically change the regular expression of a format locator during runtime by using KTM’s scripting language.

New: article about document classification with KTM

New: KTM and insurance companies: Document Process Automation

Was this post helpful?

Blog author

Jürgen Voss

Do you still have questions? Just send me a message.

fromJürgen Voss

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Kofax Transformation Modules (KTM) offers several tools for document classification and data extraction. There are some older blog articles about these tools: – Document classification – Data extraction with format locators – Machine Learning The...

Content Management
AI
Archiving
NLP

6.4.2020 | 8 minutes reading time

Jürgen Voss

Document classification, data extraction and everything

Over time, a lot of posts about document classification and data extraction, using Kofax, among other products, have been published in the codecentric blog. This blog post will put these posts into context and point out the changes with regard to older...

Content Management
AI
Archiving

20.8.2019 | 6 minutes reading time

Jürgen Voss

Orientation problems with document processing (Kofax Transformation Modules...

Document classification and data extraction in business companies have to deal with paper documents, emails and faxes. The orientation of the digitized documents (0°, 90°, 180°, 270°) usually doesn’t matter. During OCR processing the system will recognize...

Content Management
Archiving
AI

7.7.2019 | 3 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

In addition to fuzzy databases KTM also offers so-called dictionaries for the optimization of recognition. For example these dictionaries can be used in the regular expressions of a format locator to find dates of the form “01. December 2015”. The dictionary...

6.7.2017 | 2 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM), AI and Machine Learning

The topics AI, machine learning and deep learning are on everyone’s lips, and the media regularly publishes articles on them. What many do not know is that Kofax Transformation Modules (KTM) also provides mechanisms of machine learning. KTM is a system...

5.6.2017 | 5 minutes reading time

Jürgen Voss

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

With Kofax Capture you can enter document index values in a validation screen or just confirm or changes values which have been recognized automatically. The validation screen form presents all fields of a document and the user has to confirm/change ...

8.6.2016 | 4 minutes reading time

Jürgen Voss

Kofax Transformation Modules: SEPA Mandates and handwritten additional...

Within the last two years many companies had to ask their customers to sign the SEPA Direct Debit Mandates. It is an established procedure to send out forms with filled customer data (the SEPA Mandate). The customer signs the mandate and sends it back...

19.2.2016 | 5 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten...

In contrast to form based recognition, the free-form recognition tries to find certain values (like an insurance number) somewhere on a document. It is helpful if the searched value has a structure that can be found with regular expressions. Furthermore...

NLP
Archiving

19.7.2015 | 4 minutes reading time

Jürgen Voss

Kofax Capture – Document Separation and Barcodes

A well known approach to separate documents at scan time is the use of barcode labels on the first page of a document. The barcode may also be put on a single separator sheet. If a batch of documents is scanned by Kofax Capture, the barcode will be recognized...

6.1.2015 | 4 minutes reading time

Jürgen Voss

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

IBM Content Collector for SAP (ICC/SAP) is an interface for SAP ERP-Systems and IBM archiving systems: IBM Content Manager, On Demand und TSM. SAP provides the standard interface ‘ArchiveLink’ for linking external archiving systems. ICC/SAP is certified...

Content Management
NLP
Archiving

22.7.2014 | 5 minutes reading time

Jürgen Voss

KTM and insurance companies: Document Process Automation

Many of our customers are using systems for automatic document classification and data extraction. ‘Kofax Transformation Modules’ (KTM) is one of these systems. These data capturing systems extract metadata out of the electronic images (these are ...

29.11.2013 | 5 minutes reading time

Jürgen Voss

Document classification with Kofax Transformation Modules (KTM)

22.3.2013 | 6 minutes reading time

Jürgen Voss