Document classification and data extraction in business companies have to deal with paper documents, emails and faxes. The orientation of the digitized documents (0°, 90°, 180°, 270°) usually doesn’t matter. During OCR processing the system will recognize the rotation of the documents und will align them for readability.
Sometimes this alignment mechanism fails. Especially faxes include a fax headerline, which is often rotated to the body text by 180°. This happens for example when the paper sheet is put backwards into the fax machine. But even paper documents may contain notes or numbers which are printed to the left or to the right of the text, rotated by 90°. This blog post explains how to solve this problem with a small Kofax Transformation Modules (KTM) script.
Kofax Transformation Modules
Especially with faxes, the OCR engine will read the fax headline first, and the wrong orientation of the document will not be aligned:
Our customers often use ‘Kofax Transformation Modules” (KTM) for automated mailroom processing. The described problem of the failed rotation alignment also happens with KTM. Kofax offers a solution: Kofax Knowledgebase article 19794.
The described Kofax Transformation Modules script deletes rectangular regions at the margins (top, left, right, bottom) and the OCR engine will not find the text within these regions any more. Deletion at the document margins only happens within the computer memory and the source document will be unchanged. The width of these regions (in the example below: 100) must be adjusted to the documents of the customer.
1Private Sub Document_BeforeClassifyXDoc(ByVal pXDoc As CASCADELib.CscXDocument, ByRef bSkip As Boolean) 2Dim oImage As CscImage 3Dim lMargin As Long 4 5lMargin = 100 6 7'Get current image for page 1 8Set oImage = pXDoc.CDoc.Pages(0).GetImage() 9 10'Erase a margin around the edge of the image 11oImage.EraseRect 0, 0, lMargin, oImage.Height 12oImage.EraseRect oImage.Width-lMargin, 0, lMargin, oImage.Height 13oImage.EraseRect 0, 0, oImage.Width, lMargin 14oImage.EraseRect 0, oImage.Height-lMargin, oImage.Width, lMargin 15 16'Clean up memory 17Set oImage = Nothing 18 19End Sub
To check out the best width of the regions, I saved the modified document (which exists only in the computer memory) as a TIF file to a temporary directory on the hard disk. With a viewer I was able to examine the result of the deletions (see script below).
oImage.EraseRect deletes the regions by whiting them out. Thus it is often difficult to identify the deleted regions on a document with white background. Instead you may use oImage.Redact, which will mark the regions with black color. This will make the checkout of the region sizes rather easy.
1Private Sub Document_BeforeClassifyXDoc(ByVal pXDoc As CASCADELib.CscXDocument, ByRef bSkip As Boolean) 2Dim oImage As CscImage 3Dim lMargin As Long 4 5lMargin = 100 6 7'Get current image for page 1 8Set oImage = pXDoc.CDoc.Pages(0).GetImage() 9 10'Erase a margin around the edge of the image 11oImage.Redact 0, 0, lMargin, oImage.Height 12oImage.Redact oImage.Width-lMargin, 0, lMargin, oImage.Height 13oImage.Redact 0, 0, oImage.Width, lMargin 14oImage.Redact 0, oImage.Height-lMargin, oImage.Width, lMargin 15 16'Save image to temp 17oImage.Save("C:\temp\Redact.tif",CscImgFileFormatTIFFFaxG4) 18 19'Clean up memory 20Set oImage = Nothing 21End Sub
The source document:
The modified in-memory document:
The result of the Kofax Transformation Module script – the correct rotation of the document – cannot be tested with KTM Project Builder. But it works in the runtime environment. However, you may save the redacted image with the above mentioned oImage.Save to a TIF file. So you can check out the result. The scripting line with oImage.Save should only be activated within Project Builder. Please deactivate the line during runtime, as otherwise the TIF file will be saved in the runtime environment.
Summary
By using this simple redaction functionality, we were able to align all faxes well and get the correct OCR results for data extraction in our customer projects. With faxes, it is sufficent to just redact to the upper margin which contains the fax headerline.
More articles
fromJürgen Voss
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Jürgen Voss
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.