Click here to Skip to main content
15,890,506 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Dear friends,

i have been using pdftotext.exe to extract text from pdf. The text accuracy was good by using this. But the problem was i can't able to identify bold and italics text.
How can i identify the extracted text was bold or italic?

I had tried some other plugin like CSWTestingReflow, PDF parser etc..but for better text accuracy i was go with pdftotext.exe

Any idea would be appreciable..


sample code:

VB
objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " -layout " & """" & sReadPDF & "_Text.pdf" & """"
''objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " " & """" & sReadPDF & "_Text.pdf" & """"
    If fso.FileExists(sReadPDF & "_Text.txt") = True Then
                'Read the text file
                Set adoStreamOut = New ADODB.Stream
                'adoStreamOut.Charset = "utf-8"
                adoStreamOut.Charset = "us-ascii"
                If adoStreamOut.State Then adoStreamOut.Close
                adoStreamOut.Open
                adoStreamOut.LoadFromFile Replace(sReadPDF, ".pdf", "") & "_Text.txt"
                sText = adoStreamOut.ReadText
    End If
    
 DoEvents
sText = Trim(sText)
sText = Trim(Replace(sText, Chr(12), ""))
sText = Trim(Replace(sText, "." & vbCrLf, ".|||"))
sText = Trim(Replace(sText, "?" & vbCrLf, "?|||"))
sText = Trim(Replace(sText, "--" & vbCrLf, "||||||"))
sText = Trim(Replace(sText, "-" & vbCrLf, "-|||"))
sText = Trim(Replace(sText, vbCrLf, " "))
sText = Trim(Replace(sText, ".|||", "." & vbCrLf))
sText = Trim(Replace(sText, "?|||", "?" & vbCrLf))
sText = Trim(Replace(sText, "-|||", ""))
sText = Trim(Replace(sText, "||||||", "--"))
sText = Trim(Replace(sText, "--", "—"))
Do
 sText = Trim(Replace(sText, "  ", " "))
Loop Until InStr(sText, "  ") = False

Thanks
jai
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900