Click here to Skip to main content
15,890,438 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
See more:
I am trying to create the user interface for an educational video library. The videos are housed somewhere else and I want to create a site that will be user friendly and have an extensive search engine, but only for the content covered in the videos. At the moment I am manually tagging each video link with 20-30 keywords. But, I am hoping if I can figure out how to use the pdf transcripts of each video as searchable text, the tagging will be automatic and result in a better search engine. I know there are many OCR websites out there but I haven't found any personal sites with custom OCR search engines. Is this possible?
Posted
Updated 15-Oct-14 15:56pm
v2

1 solution

OCR? Sounds like you need ITextSharp. Check out their SourceFourge page and do some reading up on how to use it. Here's a simple snippet to get you started with extracting some text from a PDF file:

itextsharp read pdf file[^]
public string ParsePdf(string fileName)
{
  if (!File.Exists(fileName))
    throw new FileNotFoundException("fileName");
  using (PdfReader reader = new PdfReader(fileName))
  {
    StringBuilder sb = new StringBuilder();
 
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    for (int page = 0; page < reader.NumberOfPages; page++)
    {
      string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
      if (!string.IsNullOrWhitespace(text))
      {
        sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
      }
    }
 
    return sb.ToString();
  } 
 }
}
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900