Click here to Skip to main content
15,890,336 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am trying to extract Gujarati text from PDFs using iTextSharp, but when I convert it I got
શિવપાર્ક as િશવપાક
વિધ્યાલય as િવ ાલય
પુરુષ as ુ ુષ

My Code
C#
PdfReader reader = new PdfReader("pdf1.pdf");
int intPageNum = reader.NumberOfPages;
string text;

for (int i = 1; i <= intPageNum; i++)
{
       text += PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
}
Posted
Updated 3-Jan-16 20:06pm
v2
Comments
Kornfeld Eliyahu Peter 4-Jan-16 2:14am    
As most of us can't read Gujarati, you should explain what is wrong with that extract...
pareshpbhayani 4-Jan-16 2:30am    
Thank you Kornfeld Eliyahu Peter for reply
In gujarati શિ build from two characters શ(sh)+ઇ(i) = શિ when pdf contains this kind of character at that time i got િશ
if i write only શ(sh) than it works perfect.
Kornfeld Eliyahu Peter 4-Jan-16 3:07am    
That dotted circle in the middle of the letters means that your font does not contain the glyph behind the value!!!
Try to check with different font package...
pareshpbhayani 4-Jan-16 7:40am    
yes you are right but i am not getting any solution for that i have done R&D on it and i found that શ(sh)+ઇ(i) = શિ uses 4 bytes for શ = [182,10] (2 byte)
and for ઇ = [191, 10] (2 byte) so that the sequence becomes
[182,10, 191, 10] = શિ, but i am getting text from itextsharp
[191, 10, 182,10] = િશ

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900