Using LEADTOOLS OCR to Enhance Google Drive Search

LEADTOOLS Support

0/5 (0 vote)

Jun 1, 2013

CPOL

3 min read

26359

295

Using LEADTOOLS OCR to Enhance Google Drive Search

Download samples - 990.7 KB

Introduction

Google Drive is a wonderful service for storing, organizing and sharing files such as documents, photos and videos. However, TIFF and other raster image file formats can get easily lost because Google Drive’s search function can only do so much. With LEADTOOLS, developers can use its OCR SDK to extract the text and then add it to the IndexableTextData for each item. After this is completed, your raster image files can be searched in a similar manner to any text-based document like DOC or PDF.

For example, I have four ordinary TIFF files uploaded into Google Drive. Each of the four files are named OCR1 through OCR4, so only having the ability to search based on the file name isn’t entirely helpful.

To the human eye, these images are nothing but text, but Google Drive only sees these images as raster data and returns nothing when I try to search for something internal to the scanned document.

What would Google be without a way to search your files? Fortunately, Google Drive doesn’t leave you hanging and uses the customizable "IndexableTextData" metadata of each document when it performs text search. In the example that follows, we show how to enable Google Drive to find these TIFF documents based on the text content without modifying the original image.

Connecting to Google Drive

The first step in this application is to enable the Google Drive API for our application to retrieve the ClientID and ClientSecret. We will need these properties later when using the Google Drive API for uploading and modifying the TIFFs. Lastly, we must download the Google Client Library to reference in our solution. For more detailed information on setting up a .NET application to interface with Google Drive, visit https://developers.google.com/drive/quickstart-cs.

In our application, we will open the User Authorization Uri in the WebBrowser control so the user can enter his Google username and password. After the user logs in, we can get the authorization code from the WebBrowser control’s title. Now that the application is logged in and authorized to access Google Drive, we can search for all of the TIFF files in the account.

FileList fileList = googleDriveHelper.GetFilesList();
 
IEnumerable<File> tiffFilesEnumerable =
fileList.Items.Where(
   file => file.MimeType == "image/tiff" 
   && file.ExplicitlyTrashed != true 
   && file.UserPermission.Role == "owner");
 
foreach (File file in tiffFilesEnumerable)
{
   UpdateIndexableTextData(file);
}

Using LEADTOOLS OCR

Finally, we can use the LEADTOOLS OCR engine to get the text for each TIFF file and all of the pages within it. After creating the IOcrEngine and IOcrDocument, the RecognizeText function will return a string value of all the text extracted from the page and then update the IndexableTextData metadata in Google Drive.

void UpdateIndexableTextData(File file)
{
   StringBuilder indexableText = new StringBuilder();
 
   // Get a .NET stream of the document
   using (System.IO.Stream stream = googleDriveHelper.GetFileAsStream(file))
   {
      // Create an instance of LEADTOOLS OCR engine
      using (IOcrEngine ocrEngine =
         OcrEngineManager.CreateEngine(OcrEngineType.Advantage, false))
      {
         // Start the engine using default parameters
         ocrEngine.Startup(null, null, null, null);
 
         // Get the number of pages in the document
         int pageCount;
         using (CodecsImageInfo imageInfo = 
            ocrEngine.RasterCodecsInstance.GetInformation(stream, true))
         {
            pageCount = imageInfo.TotalPages;
         }
 
         // Create OCR Document
         using (IOcrDocument ocrDocument = ocrEngine.DocumentManager.CreateDocument())
         {
            // For each page in the document, recognize it
            for (int page = 1; page <= pageCount; page++)
            {
               ocrDocument.Pages.AddPages(stream, page, page, null);
 
               // Google Drive specific indexable text setup
               indexableText.AppendFormat(
                  "<section attribute=\"Page{0}\">", page);
               // Add the OCR text
               indexableText.Append(ocrDocument.Pages[0].RecognizeText(null));
               indexableText.Append("</section>");
 
               // Clear the document in prepartion for next page
               ocrDocument.Pages.Clear();
            }
         }
      }
   }
 
   file.IndexableText = new File.IndexableTextData();
 
   file.IndexableText.Text = indexableText.ToString();
 
   googleDriveHelper.UpdateFileMetadata(file);
}

Now that we have processed all of the TIFF files in Google Drive, they can be searched by the text in the documents, even though they are technically raster images with no textual data.

Download the Full OCR Example

You can download the fully functional demo which includes the features discussed above. To run this example you will need the following:

LEADTOOLS V18 (free 60 day evaluation)
Visual Studio 2008 or later
Extract the attached ZIP project to the LEADTOOLS examples directory (e.g. C:\LEADTOOLS 18\Examples\)

Support

Need help getting this sample up and going? Contact our support team for free technical support! For pricing or licensing questions, you can contact our sales team (sales@leadtools.com) or call us at 704-332-5532.

About LEADTOOLS

LEAD Technologies has been the prominent provider of digital imaging tools since 1990. Its award-winning LEADTOOLS family of toolkits helps developers integrate raster, document, medical, multimedia, vector and Internet imaging into their applications quickly and easily. Using LEADTOOLS for your imaging requirements allows you to spend more time on user interface and application-specific code, expediting your development cycle and increasing your return on investment.