In contrast to form based recognition, the free-form recognition tries to find certain values (like an insurance number) somewhere on a document. It is helpful if the searched value has a structure that can be found with regular expressions. Furthermore key words are often used for the search. These key words are located ‘near’ the searched values (for example ‘insurance number’, ‘ins nbr’, …)
Most of the established classification/extraction products offer this kind of tools. With machine printed text all of them will deliver sufficient results.
At our customers we are using the Kofax product Kofax Transformation Modules (KTM) for document classification and data extraction. The KTM tools for free-form recognition with machine printed text are the so called ‘format locators’. You can read about them in former KTM blog articles (1).
This article will decribe how to find handwritten numbers that have a certain structure somewhere on a document.
In this example we are searching handwritten insurance numbers on a document. These numbers have the following structure: 1x-xxxxxx-xx. The x represents a character between 0 and 9, example: 14-386723-89.
This is the example document, which will be used in our KTM project:
Within the KTM project you first have to classify the example document to the appropriate document class (in our example to the class ‘InsuranceDocs’). This can be done with any of the available KTM classification methods (see also: Document classification with KTM ).
A field ‘InsuranceNumber’ and a locator ‘Numbers’ (Advanced Zone Locator) should be added to the document class ‘InsuranceDocs’:
This is the base idea behind ‘free-form recognition’ for handwritten numbers:
- The Advanced Zone Locator reads the text of the page by sizing its zone large enough to cover the entire page (or at least the region where the handwritten numbers may occur).
- From experience the RecoStar Engine reads numerical characters better than the FineReader Engine. Therefore RecoStar is used within the Advanced Zone Locator with a numerical recognition profile [0-9-].
- The result of the Advanced Zone Locator will be a string consisting of numerical characters and -.
- Within the script of the document class ‘InsuranceDocs’ the result string will be examined for insurance numbers using regular expressions.
- If possible the found insurance number should be checked against an inventory database and finally put into the extraction field ‘InsuranceNumber’.
Setup of the Advanced Zone Locator
Draw the zone on the region of the example page, where the handwritten numbers may occur:
Set the zone recognition profile to a RecoStar zone engine with these settings:
Remove the checkmark at ‘Registration failure makes zone invalid’, as registration will always fail with unstructured documents, and we want to keep the result in any case:
Testing of the Advanced Zone Locator will show this result:
At first this looks somewhat messy, but in the fourth line from bottom, the desired insurance number shows up. Now this number still has to be extracted from the result string.
Extraction of the insurance number by scripting
Exemplarily we are using the event ‘Document_AfterProcess’ in the script of document class ‘InsuranceDocs’, to extract the insurance number out of the result string of the Advanced Zone Locator by using regular expressions.
First of all the library ‘Microsoft VBScript Regular Expressions 5.5’ has to be added as reference to the script:
This Microsoft library enables your scripting to search with regular expression in string variables (Microsoft VBScript Regular Expressions 5.5 Description ).
The actual KTM scripting finally looks like:
1Option Explicit 2 3' Class script: InsuranceDocs 4 5Private Sub Document_AfterProcess(ByVal pXDoc As CASCADELib.CscXDocument) 6 Dim String_RecoStar As String 7 Dim myRegExp As RegExp 8 Dim myMatches As MatchCollection 9 Dim myMatch As Match 10 Dim InsNbr_Recostar As String 11 12 Set myRegExp = New RegExp 13 14 'get the first alternative from the advanced zone locator 15 String_RecoStar=Trim(pXDoc.Locators.ItemByName("Numbers").Alternatives(0).SubFields.ItemByName("UF_Zone0").Text) 16 17 myRegExp.IgnoreCase = True 18 myRegExp.Global = True 19 'define the regular expression for the insurance numbers 20 myRegExp.Pattern = "1(1|2|3|4|5|6|7|8|9)\s?\-\s?\d{6}\s?\-\s?\d{2}" 21 22 Set myMatches = myRegExp.Execute(String_RecoStar) 23 If myMatches.Count>0 Then 'if something was found: 24 'we just take the first result in this example... 25 InsNbr_Recostar=Replace(myMatches.Item(0)," ","") 'get rid of spaces 26 If DB_Check(InsNbr_Recostar)=True Then 'if possible validate the number against a database 27 'put the value into the InsuranceNumber field 28 pXDoc.Fields.ItemByName("InsuranceNumber").Text=InsNbr_Recostar 29 pXDoc.Fields.ItemByName("InsuranceNumber").Valid=True 30 End If 31 End If 32End Sub 33 34Function DB_Check(Number As String) As Boolean 35 DB_Check=True 'just return True in this example 36 'Implement the database validation of the extracted insurance number 37End Function
Processing the example document in KTM project builder will finally produce a result like this:
(1) More codecentric blog articles about KTM:
KTM and insurance companies: Document Process Automation
Document classification with Kofax Transformation Modules (KTM)
Kofax Transformation Modules – format locators and dynamic regular expressions – Part 2
Kofax Transformation Modules – format locators and dynamic regular expressions
More articles
fromJürgen Voss
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Jürgen Voss
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.