Click here to Skip to main content
15,867,568 members
Articles / Programming Languages / C#
Article

Document Processing Part II: Request Driven OCR

Rate me:
Please Sign up or sign in to vote.
5.00/5 (39 votes)
30 Apr 20054 min read 226.6K   6.9K   135   57
To get qualified access to paper based information, sometimes more than plain OCR is needed. This article shows why, and offers a solution to increase OCR quality by semi-automatic table extraction.

Sample Image - tableextractor.gif

Introduction

Document processing is used since decades in the financial and insurance industry. In this second part of the overview, the subject is Request Driven Extraction (RDE) as the next step beyond plain OCR analysis.

OCR is a powerful and popular technique to read paper based documents. Today’s OCR systems are no longer restricted to read floating text passages. They also provide higher layout structures like lists and tables. So why do we need extra table extraction?

The Problem

Here is an example: Let’s say, you want to export table data to an Excel file. In our example, you have 3 paper documents with the very same table layout (your phone bills for example). If you are not willing to type in every character manually, you can scan the documents and perform an OCR analysis. But since OCR cannot ‘know’ that all documents contain the same table layout, you will get (worst case) three different table formats.

Tables with plain OCR

A Solution

That is where RDE comes in. With RDE you create a unique table pattern for all documents and let the machine create corresponding results. After the process, you have one single information model for every document - a small but very important difference.

Sample screenshot

The TableExtractor - A User's Manual

To show RDE technology in a simple way, I created the TableExtractor. This application can be seen as an expanded version to the MODI example from Document Processing Part I. Again, MS Office 2003 is required. The new feature is the 'Table Capture Frame'. This is a semi-transparent tool window to customize your personal table requests. The next steps guide you through the whole process of table extraction:

  1. Open an image document.
  2. Press the OCR button to get plain document text.
  3. Select a table you want to extract by using the red selection area.

    Sample screenshot

  4. Press the Adjust button. That will show the Table capture frame. By default, the table request will contain only one single column.
  5. To customize your table request, choose Add Columns and resize them by dragging the column headers.
  6. Press Capture to extract the table.
  7. Export the table result to a file.
  8. Open a new document. Of course, you may use the already customized table.

The TableExtractor – Technical Aspects

The implementation neither includes special tricks nor does it provide breaking new design patterns. I want to draw your attention to the underlying object model.

The Document Model

For the application, we design a simple document model. This is a hierarchy of four layout element classes: Documents, pages, lines, words. We don’t use the MODI Object model this time, because we need the line elements which are not provided in the MODI model. After OCR process is done, we generate an instance of that model by converting the MODI objects. At this point we will not generate lines. That is because of the special character of the cluster algorithm, which clusters words to lines.

The Request Model

In order to represent our knowledge about the table, we create a table request. This table request contains column requests. Each column request provides the relative width referring to the table.

Table Extraction Process

The extraction process is implemented in two simple steps. In the first step, the table’s lines are clustered from the selected word elements. This clustering does a so called Hugh-transformation. Wherever words have overlapping projection on the Y-axis, they are combined to a line element. That’s the reason why we don’t do global line segmentation. Because of noise elements (like OCR errors), Hugh transformations work better when restricted to small areas. The second step iterates through all generated lines and splits the contained words to columns. This is done by using simple intersection criteria.

Restrictions

In this article a very basic table model is described. There are plenty of features to expand this model. Just to give you an idea, I listed a few examples. Please be aware, that some of these points may get very complex and that their development is currently keeping a lot of people busy.

  • Multiple line requests: In our example request model, only one type of line is defined. You may allow alternative line types in one single table.
  • AutoCorrection: You may add data format templates (e.g. regular expressions) to the column request model. That enables you to detect or correct OCR errors in the text content.
  • Column Order: You may expand the model for different column orders and optional columns.

History

  • 2005-04-25

    Initial version.

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
CEO Axonic Informationssysteme GmbH, Germany
Germany Germany

Comments and Discussions

 
Questionwin 8 installation Pin
khaks10-Dec-13 3:08
khaks10-Dec-13 3:08 
QuestionA Simple OCR Sample in C# Pin
ZamirF28-Nov-11 12:26
ZamirF28-Nov-11 12:26 
GeneralMy vote of 5 Pin
LaxmanDigari2-Aug-11 20:46
LaxmanDigari2-Aug-11 20:46 
GeneralTable from .tiff to word document Pin
Member 128294204-Jul-17 22:27
Member 128294204-Jul-17 22:27 
QuestionHow to get Table with data in text document file from .tiff file using MODI Pin
LaxmanDigari28-Jul-11 18:53
LaxmanDigari28-Jul-11 18:53 
GeneralAbout multiple language support Pin
Cryptyritu16-May-11 22:52
Cryptyritu16-May-11 22:52 
GeneralMy vote of 5 Pin
leonardo brambilla3-Jan-11 11:34
leonardo brambilla3-Jan-11 11:34 
GeneralSupport Arabic and language not ont the list Pin
Mohamed Mitwalli7-Jun-10 1:58
Mohamed Mitwalli7-Jun-10 1:58 
GeneralIntegration of this code with third party OCR Pin
ankswe13-Aug-09 17:06
ankswe13-Aug-09 17:06 
QuestionDocument Processing Part I. Where? Pin
OGAAA16-Jul-09 2:26
OGAAA16-Jul-09 2:26 
GeneralSave selection and resuse saved selection in re-occuring document's with same layout Pin
superlus22-Jun-09 10:18
superlus22-Jun-09 10:18 
QuestionTableExtractor doesn't respect columns? Pin
papadeltasierra18-Jun-08 22:11
papadeltasierra18-Jun-08 22:11 
Question(c#)How can i use the contextMenuStrip in the AxMiDocView? Pin
Alex Pan2-Jun-08 16:22
Alex Pan2-Jun-08 16:22 
QuestionOCR: bad language? Pin
liuyanshun21-Mar-08 2:53
liuyanshun21-Mar-08 2:53 
AnswerRe: OCR: bad language? Pin
Alex Pan2-Jun-08 16:33
Alex Pan2-Jun-08 16:33 
QuestionTable extractor Help Pin
dbeckner8-Nov-07 18:23
dbeckner8-Nov-07 18:23 
GeneralappAccess Pin
rwsankey15-Oct-07 10:13
rwsankey15-Oct-07 10:13 
GeneralRe: appAccess Pin
Martin Welker15-Oct-07 21:27
Martin Welker15-Oct-07 21:27 
GeneralRe: appAccess Pin
rwsankey16-Oct-07 7:23
rwsankey16-Oct-07 7:23 
GeneralProblem releasing Image (after OCR) Pin
emmerick13-Sep-07 10:14
emmerick13-Sep-07 10:14 
GeneralRe: Problem releasing Image (after OCR) Pin
rev2916-Sep-07 19:02
rev2916-Sep-07 19:02 
GeneralRe: Problem releasing Image (after OCR) Pin
VanOrman17-Oct-07 5:23
VanOrman17-Oct-07 5:23 
QuestionIN VB.NET? Pin
BIRDENT26-Jul-07 6:27
BIRDENT26-Jul-07 6:27 
IN VB.NET?
QuestionWhat abot multiple tables on a document Pin
Børge Hansen13-Nov-06 1:51
Børge Hansen13-Nov-06 1:51 
QuestionHow to install ASP.net application with MODI Pin
easternsoldier8-Nov-06 6:55
easternsoldier8-Nov-06 6:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.