How to make use of OCR technology through a web browser

Dynamsoft

5.00/5 (2 votes)

Aug 9, 2012

CPOL

6 min read

46533

1517

In this article, I’ll show you how to convert scanned images to searchable text and PDF files through a web browser.

Download source - 2 MB

Introduction

Dynamsoft’s Dynamic Web TWAIN SDK is a powerful solution for web-based image processing, which allows developers to ignore the low-level details, and focus on what is important. In July, 2012 Dynamic Web TWAIN 8.0 was released to the market, adding new cutting-edge technology to the API. Two new powerful add-ons were added which add support for OCR (Optical Character Recognition), and Barcode Recognition. This document will focus on the OCR add-on, which allows developers to make use of this technology without worrying about the low-level implementation, while maintaining flexibility.

The web version of the SDK is controlled by JavaScript, a scripting language that almost all web developers are familiar with. A wide library of functions and properties is provided to give developers full control, and implement their own interface easily. All popular browsers are supported, and the underlying Web TWAIN software is provided as an ActiveX control as well as a browser plugin. Users of Internet Explorer can make use of the ActiveX version, and users of other browsers, such as Firefox, Chrome, Safari, and Opera, can make use of the plugin version. For those not using the Windows operating system, a Mac plugin edition is also available.

Below, we’ll make use of Dynamsoft’s new Dynamic Web TWAIN OCR add-on to extract text from a scanned image. All three output modes will be demonstrated: Plain Text, Plain Text PDF, and Image over Text PDF. If the Dynamic Web TWAIN SDK is not yet installed on your system, you can easily download a trial, or view and use the online demo on the Dynamsoft web-site.

What Dynamic OCR Supports

Over 40 languages, including Arabic and various Asian languages.
All the common file formats: jpg, gif, png, bmp, tiff, and more.
Multiple-page document processing.
Hand-written and printed characters.
Font name and size recognition.
Detailed positioning and format information.
PDF output maintaining the look of the original document.
Integration with Dynamic Web TWAIN, so images can be edited before OCR is performed.

How to Use OCR in Your Application

The following code samples are all provided in JavaScript, under the assumption that a WebTWAIN object has already been created with the variable name WebTWAIN.

Understanding OCR Settings

Before OCR begins, a number of settings can be changed which affect the output. While sensible defaults are provided, there are many situations where it may be appropriate to change settings.

OCRDllPath: This needs to be set to the path where the OCR dll file is located. By default, this is set to the current working directory.

OCRTessDataPath: This needs to be set to the root path for the application, where the tessdata folder for the language you wish to use is located.

OCRLanguage: By default, this is set to "eng". This refers to the prefix of the language files you wish to use.

OCRResultFormat: An important setting that determines whether the OCR results are saved to a text file or a PDF file.

OCRUseDetectedFont: Determines whether or not OCR should try to use the detected font faces when outputting PDF documents. The detected fonts need to exist on the system for them to be used, so use this option with caution.

OCRUnicodeFontName: A backup font to be used in the case that OCRUseDetectedFont is off. This font must also exist on the system. If this is not provided, the library will attempt to use an appropriate font for the language used.

OCRPageSetMode: The default is fine for most operations for this setting. This affects the way the page formatting is determined by OCR, and by default is set to automatic.

OCR On An Image in the TWAIN Buffer

In this example, we will perform English OCR on a given image by index in the visible images that have been loaded into TWAIN. The results get saved to a text file called OCRResults.txt. Since we’re only doing text output, we don’t need to worry about any font settings, and the defaults are used for all the other settings, such as OCRDllPath and OCRTessDataPath.

function DoOCR(imageIndex) {
WebTWAIN.OCRLanguage = "eng";
WebTWAIN.OCRResultFormat = 0; //Text
return WebTWAIN.OCR(imageIndex, "OCRResults.txt"); //Returns true for success, false for fail
}

OCR on an External Image

The OCR functions are very flexible, and also allow for images to be loaded directly from multiple files, with paths relative to the working directory. In this sample, we will also use a different setting for OCRTessDataPath, and use Simplified Chinese for the language.

function ChineseOCR() {
WebTWAIN.OCRLanguage = "chi_sim";
WebTWAIN.OCRResultFormat = 0; //Text
WebTWAIN.OCRTessDataPath = "../../tesseract-chinese/"; //Should include final /
//Multiple files are separated by the | character
return WebTWAIN.OCRDirectly("wendang1.tif|wendang2.tif|webdang3.tif", "ChineseOCR.txt");
}

Using Chinese for the language demonstrates the UTF output capabilities of Dynamic OCR. The resulting text is encoded in UTF-8 Unicode, and the file should be read by an editor that supports UTF-8.

OCR to a Plain-text PDF

Dynamic OCR also supports output directly to PDF files. The simplest way to do this is to output text only, which is perfect for documents and scans that contain primarily text. The given screenshots show a piece of the results of the below source code for PDF output.

function OCRToPDF() {
WebTWAIN.OCRLanguage = "eng";
WebTWAIN.OCRResultFormat = 1; //Plain-text PDF
return WebTWAIN.OCRDirectly("Demo_OCR1.png", "Plain.pdf");
}

Original image:

Plain-text PDF:

As can be seen clearly, the plain-text PDF maintains all the text and its positioning information perfectly. However the colour, italics, and images are lost. In many cases this may be an acceptable or the desired result, however in other cases the images are important, and the Image-over-Text option should be used instead.

OCR to an Image-over-Text PDF

Image-over-Text PDFs maintain the original look of the document, but add the ability to select, copy, and search text. They are ideal for scans of complex tables, books, or other documents that contain images and complicated formatting. Below is an example of the same code, except with Image-over-Text as the OCRResultFormat.

function OCRToPDF() {
WebTWAIN.OCRLanguage = "eng";
WebTWAIN.OCRResultFormat = 2; //Image-over-text
return WebTWAIN.OCRDirectly("Demo_OCR1.png", "ImageOverText.pdf");
}

The above image is a screenshot of the resulting PDF, with some of the text being selected. As you can see, the text selection is accurate, and the OCR results could be copied or searched through just as if it were a text document.

OCR to a String in Memory

Of course, the results of OCR can also be saved in memory, whether in the form of plain text or a PDF. In this sample, the results are saved to a plain text string. A string could also hold Base64 encoded PDF results if the OCRResultFormat was not set to 0. After the results are saved, they are written to the page with document.write.

function OCRToString(imageIndex)
WebTWAIN.OCRLanguage = "eng";
WebTWAIN.OCRResultFormat = 0; //Text
return WebTWAIN.OCREx(imageIndex);
}
var results = OCRToString(0); //UTF-8 encoded OCR results (ascii compatible)
document.write(results);

Download the Sample

To try out the above mentioned features by yourself, you can go to the online demo at: Dynamic Web TWAIN OCR Barcode Online Demo

If you’d like to evaluate Dynamic Web TWAIN which includes the OCR add-on, you can download the free trial here: Dynamic Web TWAIN 30-day Free Trial.

If you have any questions, you can contact our support team at twainsupport@dynamsoft.com.