Click here to Skip to main content
15,867,686 members
Articles / Desktop Programming / Win32

How To: Use Office 2007 OCR Using C#

Rate me:
Please Sign up or sign in to vote.
5.00/5 (36 votes)
24 Aug 2009CPOL3 min read 337.5K   23.3K   132   48
Reading text from any image using Microsoft Office 2007 OCR

Introduction

The sample application checks for images in a specified directory and reads text from these images if any. It saves text from each image in a text file with the same name as the image, automatically. It can handle problems or exceptions with images.

If you have Office 2007 installed, the OCR component is available for you to use. The only dependency that's added to your code is Office 2007. Requiring Office (2007 or 2003) to be installed in order for your code to work may or may not fit a situation. But if your client can guarantee that machines that your code will run on have Office (2007 or 2003) installed, then this solution is ideal for you.

What is OCR?

OCR (Optical Character Recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.

Or, we can say... Optical character recognition (OCR) translates images of text, such as scanned documents, into actual text characters. Also known as text recognition, OCR makes it possible to edit and reuse the text that is normally locked inside scanned images. OCR works using a form of artificial intelligence known as pattern recognition, to identify individual text characters on a page, including punctuation marks, spaces, and ends of lines.

What is Document Imaging?

Document imaging is the process of scanning paper documents, and converting them to digital images that are then stored on CD, DVD, or other magnetic storage. With Microsoft Office Document Imaging, you can scan paper documents and convert them to digital images that you can save in:

  • Tagged Image File Format (TIFF): A high-resolution, tag-based graphics format. TIFF is used for the universal interchange of digital graphics.
  • Microsoft Document Imaging Format (MDI): A high resolution, tag-based graphics format, based on the Tagged Image File Format (TIFF) used for digital graphics.

to your computer’s hard disk, network server, CD, or DVD. Microsoft Office Document Imaging also gives you the ability to perform Optical Character Recognition (OCR) either as part of scanning a document, or while you work with a TIFF or MDI file. By performing OCR, you can then copy recognized text from a scanned image or a fax into a Microsoft Word document or other Office program files.

Weakness

To run the application that uses OCR, you must have the Office OCR Component installed in your machine. That means, without the Office OCR component, your application will not work.

Strength

It's a free component that comes with Office and you can use it in your code for free. It is easy to use because Microsoft presents many sample code for how to use this component.

Namespaces

C#
using System.Collections;
using System.IO;
using System.Drawing.Imaging;

Using the Code

The name of the COM object that you need to add as a reference is Microsoft Office Document Imaging 12.0 Type Library. By default, Office 2007 doesn't install it. You'll need to make sure that it's added by using the Office 2007 installation program. Just run the installer, click on the Continue button with the "Add or Remove Features" selection made, and ensure that the imaging component is installed.

The OCR engine always defaults to the user's regional settings for the LangID argument, unless you specify the language explicitly when calling the OCR method; it does not retain the previously used setting. In a mixed-language environment, it is a good practice to specify the LangID argument explicitly in every call to the OCR method.

So, create a Windows Application using C#. From Visual Studio Solution Explorer >> right click on References >> select the COM tab >> then select Microsoft Office Document Imaging 12.0 Type Library.

C#
/// <summary>
/// Check for Images
/// read text from these images.
/// save text from each image in text file automatically.
/// handle problems with images
/// </summary>
/// <param name="directoryPath">Set Directory Path to check for Images in it</param>
public void CheckFileType(string directoryPath) 
{ 
    IEnumerator files = Directory.GetFiles(directoryPath).GetEnumerator(); 
    while (files.MoveNext()) 
    { 
        //get file extension 
        string fileExtension = Path.GetExtension(Convert.ToString(files.Current));

        //get file name without extension 
        string fileName=
          Convert.ToString(files.Current).Replace(fileExtension,string.Empty);

        //Check for JPG File Format 
        if (fileExtension == ".jpg" || fileExtension == ".JPG")
        // or // ImageFormat.Jpeg.ToString()
        {
            try 
            { 
                //OCR Operations ... 
                MODI.Document md = new MODI.Document(); 
                md.Create(Convert.ToString(files.Current)); 
                md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true); 
                MODI.Image image = (MODI.Image)md.Images[0];

                //create text file with the same Image file name 
                FileStream createFile = 
                  new FileStream(fileName + ".txt",FileMode.CreateNew);
                //save the image text in the text file 
                StreamWriter writeFile = new StreamWriter(createFile); 
                writeFile.Write(image.Layout.Text); 
                writeFile.Close(); 
            } 
            catch (Exception exc) 
            { 
                //uncomment the below code to see the expected errors
                //MessageBox.Show(exc.Message,
                //"OCR Exception",
                //MessageBoxButtons.OK, MessageBoxIcon.Information); 
            } 
        } 
    } 
}

Points of Interest

I have made a big sample application for Office OCR and I'll release it soon.

Remark

There are many people who use OCR for Internet Spiders to get data.

My Blog

References

History

  • 24-08-2009: Released

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) Equinox Web
Egypt Egypt
I have 5 years experience working as a Software Developer. I have a wide range of experience in programming and I am skilled in the use of Visual Studio.NET 2008, Windows AppLication, Web Application, Web Services, Windows Services, WPF, HTML, Java Script, Ajax, ASP.NET, DevExpress Controls, Office Application Programmability in Visual Studio.NET 2008, creating web and windows applications using C#.NET and experienced in using all Microsoft Office Applications.

Comments and Discussions

 
QuestionThanks for posting! Had to add System.Drawing library Pin
Member 133950165-Sep-17 14:04
Member 133950165-Sep-17 14:04 
QuestionI got error of directory path Pin
Dharmesh .S. Patil2-Jun-15 19:15
professionalDharmesh .S. Patil2-Jun-15 19:15 
QuestionIt is not working for small font and distorted text. Pin
JhonAbraham1-Dec-14 5:21
JhonAbraham1-Dec-14 5:21 
QuestionMODI.Document md = new MODI.Document() is crashing Pin
Sivaji156530-Nov-14 19:58
Sivaji156530-Nov-14 19:58 
Hi
I am getting the following error:
Retrieving the COM class factory for component with CLSID {40942A6C-1520-4132-BDF8-BDC1F71F547B} failed due to the following error: 80040154.

I Installed Microsoft document Imaging tool from control panel->Programs and features->click Microsoft Office 2007->click Change->Select Add or remove features ->continue.

The Office installer made all the registry entries in wow64bitnode of registry but I have 32 bit office installed in my machine. It is supposed to make all the CLSID entries in 32 bit registry.

Is the crash because of this mismatch in CLSID, if yes, how to move all the 64 bit registry entries to 32 bit node ?

My system is windows 8.1 64 bit with Office 2007 32 bit. Please somebody suggest me a solution.

Thanks
Sivaji
AnswerRe: MODI.Document md = new MODI.Document() is crashing Pin
GiorgioMyA25-Apr-18 9:14
GiorgioMyA25-Apr-18 9:14 
Questioni have a problem Pin
Member 104514491-Jun-14 11:38
Member 104514491-Jun-14 11:38 
QuestionMODI for azeri latin Pin
Sabuhi Asadullayev29-Apr-14 21:40
Sabuhi Asadullayev29-Apr-14 21:40 
QuestionOCR using C# for windows phone 8 Pin
Member 1024264630-Aug-13 3:00
Member 1024264630-Aug-13 3:00 
AnswerRe: OCR using C# for windows phone 8 Pin
Prince Jeelani26-Nov-13 18:44
Prince Jeelani26-Nov-13 18:44 
QuestionAccessviolation Exception during OCR Read. Pin
BeBadgujar29-Aug-13 19:38
BeBadgujar29-Aug-13 19:38 
QuestionIt's not working for some images. Pin
DynamicDeveloper26-Jun-13 1:58
DynamicDeveloper26-Jun-13 1:58 
GeneralMy vote of 5 Pin
championdai3-Jan-13 14:37
championdai3-Jan-13 14:37 
QuestionSet font? Pin
Amit D Rajput3-Nov-12 4:02
Amit D Rajput3-Nov-12 4:02 
GeneralMy vote of 5 Pin
Kannan.Ramjalwar15-Sep-12 23:35
Kannan.Ramjalwar15-Sep-12 23:35 
Question[My vote of 1] Oh excellent! Pin
Mapsaels12-Sep-12 11:33
Mapsaels12-Sep-12 11:33 
Questionerror Pin
basaparabhu11-Sep-12 21:27
basaparabhu11-Sep-12 21:27 
BugRe: error Pin
GentlemanK3-Nov-14 5:26
GentlemanK3-Nov-14 5:26 
Questionunable to find Microsoft Office Document Imaging 12.0 Pin
Muthu Nadar3-Jun-12 7:32
Muthu Nadar3-Jun-12 7:32 
AnswerRe: unable to find Microsoft Office Document Imaging 12.0 Pin
Holz_A.4-Jun-13 21:52
Holz_A.4-Jun-13 21:52 
QuestionCar Plat Number Pin
Tramanah4-Dec-11 22:55
Tramanah4-Dec-11 22:55 
AnswerRe: Car Plat Number Pin
TayTun8-Jan-12 9:24
TayTun8-Jan-12 9:24 
Questionurdu language ocr Pin
farhadidrees12316-Oct-11 9:44
farhadidrees12316-Oct-11 9:44 
QuestionThanks ,it work well. Pin
rysheng20-Sep-11 6:25
rysheng20-Sep-11 6:25 
AnswerRe: Thanks ,it work well. Pin
efuewgf21-May-12 1:46
efuewgf21-May-12 1:46 
GeneralMy vote of 5 Pin
Аslam Iqbal16-Jul-11 5:11
professionalАslam Iqbal16-Jul-11 5:11 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.