Dear friends,
i have been using pdftotext.exe to extract text from pdf. The text accuracy was good by using this. But the problem was i can't able to identify bold and italics text.
How can i identify the extracted text was bold or italic?
I had tried some other plugin like CSWTestingReflow, PDF parser etc..but for better text accuracy i was go with pdftotext.exe
Any idea would be appreciable..
sample code:
objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " -layout " & """" & sReadPDF & "_Text.pdf" & """"
If fso.FileExists(sReadPDF & "_Text.txt") = True Then
Set adoStreamOut = New ADODB.Stream
adoStreamOut.Charset = "us-ascii"
If adoStreamOut.State Then adoStreamOut.Close
adoStreamOut.Open
adoStreamOut.LoadFromFile Replace(sReadPDF, ".pdf", "") & "_Text.txt"
sText = adoStreamOut.ReadText
End If
DoEvents
sText = Trim(sText)
sText = Trim(Replace(sText, Chr(12), ""))
sText = Trim(Replace(sText, "." & vbCrLf, ".|||"))
sText = Trim(Replace(sText, "?" & vbCrLf, "?|||"))
sText = Trim(Replace(sText, "--" & vbCrLf, "||||||"))
sText = Trim(Replace(sText, "-" & vbCrLf, "-|||"))
sText = Trim(Replace(sText, vbCrLf, " "))
sText = Trim(Replace(sText, ".|||", "." & vbCrLf))
sText = Trim(Replace(sText, "?|||", "?" & vbCrLf))
sText = Trim(Replace(sText, "-|||", ""))
sText = Trim(Replace(sText, "||||||", "--"))
sText = Trim(Replace(sText, "--", "—"))
Do
sText = Trim(Replace(sText, " ", " "))
Loop Until InStr(sText, " ") = False
Thanks
jai