How do I read text from pdf with embedded fonts?

Question

1.00/5 (1 vote)

See more:

Hi,
I need to extect text from pdf with custom fonts but custom don't let to copy/paste text or search text or extract text in a clear/readble way by iText lib... the resultant text is space or non uman readable chars

The pdf format are: Author: User Creator: Compart Docponent API Producer: Compart MFFPDF I/O Filter 2013-03-09 00:51:11 CreationDate: 04/21/16 11:26:59 ModDate: 06/09/16 10:02:16 Tagged: no Form: none Pages: 6 Encrypted: no Page size: 595.2 x 841.92 pts (A4) (rotated 0 degrees) File size: 312703 bytes Optimized: yes PDF version: 1.4

the pdf fonts info are (running pdffonts command line for each fonts): name:[none] ; type:[Type 3] ; emb: [yes]; sub: [no]; uni : [yes];

so the pdf seems to have a ToUnicode map but that is not enough also with the follow code

How I can read text in a clear way?

thanks in advance

G.G.

What I have tried:

C#

dftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    text.Append(System.Environment.NewLine);
    text.Append("\n Page Number:" + page);
    text.Append(System.Environment.NewLine);
    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);
}
pdftext.Text += text.ToString();
pdfReader.Close();

Posted 10-Jun-16 7:59am

guton

Updated 10-Jun-16 8:33am

Add a Solution

Comments

Sergey Alexandrovich Kryukov 10-Jun-16 14:29pm

Just a hint: a font has nothing to do with the text, which is a string, something unrelated to the font used to render some string on some media, such as screen or printer page. Just forget the font.

Now, what are those unreadable chars. Look at the data, the byte, not at how they are rendered. Do they appear in the string "currentText"? or currentText after the weird line where you convert it to bytes and than back to string. Why?

—SA

guton 10-Jun-16 15:17pm

Hi Sergey,
thanks for the reply
..that line of code is related to convert byte array to the rigth codepage depending to the related font and finally to the string... (see the official iText code line) ..note also that the pdf file don't let to copy/paste text or search text...

you can try by youself with that code on this file https://drive.google.com/file/d/0B0f6X4SAMh2KRDJTbm4tb3E1a1U/view

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2016-06-10T08:33:00

Solution 1

One apparent problem is:

C#

currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

It makes no sense at all.

Try to throw it out.

—SA

Posted 10-Jun-16 8:33am

Sergey Alexandrovich Kryukov