Click here to Skip to main content
15,867,453 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
My project involves reading text from a bunch of PDF form files for which I'm using PyPDF2 open source library. There is no issue in getting the text data as follows:

Python
reader = PdfReader("data/test.pdf")
cnt = len(reader.pages)
print("reading pdf (%d pages)" % cnt)
page = reader.pages[cnt-1]
lines = page.extract_text().splitlines()
print("%d lines extracted..." % len(lines))


However, this text doesn't contain the checked statuses of the radio and checkboxes. I just get normal text (like "Yes No" for example) instead of these values.

What I have tried:

I also tried the reader.get_fields() and reader.get_form_text_fields() methods as described in their documentation but they return empty values. I also tried reading it through annotations but no "/Annots" found on the page. When I open the PDF in a notepad++ to see its meta data, this is what I get:

%PDF-1.4
%²³´µ
%Generated by ExpertPdf v9.2.2


It appears to me that these checkboxes aren't usual form fields used in PDF but appear similar to HTML elements. Is there any way to extract these fields using python?
Posted
Updated 23-Aug-22 0:22am
v2

1 solution

You should post your question at Discussions · py-pdf/PyPDF2 · GitHub[^].
 
Share this answer
 
Comments
Prahlad Yeri 23-Aug-22 4:53am    
Thanks Richard! I actually posted on their issue tracker and they're interacting, let's see how it goes.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900