Click here to Skip to main content
15,891,033 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
Python Version: 3

Input: PDF file containing Purchase order Input Example: http://gem.compaq.com/gemstore/sites/downloads/SLED_PO_Template.pdf

Note: This is empty purchase order sample format, actual Format may vary. In real time pdf may not be empty.

Desired Output is to get key name and its value from pdf.

Sample Output:

PO number: its value in pdf (Same for other keys)

Question: How to extract name of keys and its relevant value data from given pdf file?

What I have tried:

Tried tabula-py, pdfminer2, pdftotext, OCR, pdf2json.
But main challenge I am facing is: Relating key with its true value.
Posted
Updated 4-Jul-18 7:15am
Comments
Richard MacCutchan 4-Jul-18 10:11am    
This is not really a programming issue. The answer depends on the structure of the PDF file and how these items can be recognised and related.

1 solution

"Dump" the PDF to a text file.

If the PDF contains "markup" that identifies the PO# (if you can find it using a "Find" on the text), then you can use that "markup" to locate the PO# in other documents.

Understanding the Portable Document Format (PDF) - PrintMyFolders[^]
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900