Click here to Skip to main content
15,887,214 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
Hi,
I need to extect text from pdf with custom fonts but custom don't let to copy/paste text or search text or extract text in a clear/readble way by iText lib... the resultant text is space or non uman readable chars

The pdf format are: Author: User Creator: Compart Docponent API Producer: Compart MFFPDF I/O Filter 2013-03-09 00:51:11 CreationDate: 04/21/16 11:26:59 ModDate: 06/09/16 10:02:16 Tagged: no Form: none Pages: 6 Encrypted: no Page size: 595.2 x 841.92 pts (A4) (rotated 0 degrees) File size: 312703 bytes Optimized: yes PDF version: 1.4

the pdf fonts info are (running pdffonts command line for each fonts): name:[none] ; type:[Type 3] ; emb: [yes]; sub: [no]; uni : [yes];

so the pdf seems to have a ToUnicode map but that is not enough also with the follow code

How I can read text in a clear way?

thanks in advance

G.G.

What I have tried:

C#
dftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    text.Append(System.Environment.NewLine);
    text.Append("\n Page Number:" + page);
    text.Append(System.Environment.NewLine);
    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);
}
pdftext.Text += text.ToString();
pdfReader.Close();
Posted
Updated 10-Jun-16 8:33am
Comments
Sergey Alexandrovich Kryukov 10-Jun-16 14:29pm    
Just a hint: a font has nothing to do with the text, which is a string, something unrelated to the font used to render some string on some media, such as screen or printer page. Just forget the font.

Now, what are those unreadable chars. Look at the data, the byte, not at how they are rendered. Do they appear in the string "currentText"? or currentText after the weird line where you convert it to bytes and than back to string. Why?

—SA
guton 10-Jun-16 15:17pm    
Hi Sergey,
thanks for the reply
..that line of code is related to convert byte array to the rigth codepage depending to the related font and finally to the string... (see the official iText code line) ..note also that the pdf file don't let to copy/paste text or search text...

you can try by youself with that code on this file https://drive.google.com/file/d/0B0f6X4SAMh2KRDJTbm4tb3E1a1U/view

1 solution

One apparent problem is:
C#
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

It makes no sense at all.

Try to throw it out.

—SA
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900