Click here to Skip to main content
15,867,686 members
Please Sign up or sign in to vote.
1.00/5 (3 votes)
See more:
Step 1: firstly create an AWS account. And then create an IAM user with AWS
textract full access.
Step 2: After creating IAM used we get aws_access_key_id, and
aws_secret_access_key these keys are important to connect AWS with our local
python script.
Step 3: When the code triggers firstly it split a pdf file into images for that pdf2image
library used.
Step 4: After that one by one image pass to the createTable(image) function using
this function it fetches a table from the image and saves it into a CSV file.
Step 5: Step 4 gets repeated for the key-value pair.
Step 6: Once all files are created Code() function trigger
Step 7: In the code function, it takes only those CSV files whose header contains Test
name, Technology, value, and units this header is for thyrocare. We can change the
header according to reports.
Step 8: Then add that test data into JSON format.
Step 9: Once JSON is created it dumps into MongoDB. For connection of Python
and MongoDB pymongo library use.
Step 10: After copy the JSON file and pdf file into the pdf collection folder. And
others CSV, the image file gets deleted


What I have tried:

from pdf2image import convert_from_path
import boto3
import csv
import os

images = convert_from_path('table.pdf',500) #poppler_path=r'C:\Program Files\poppler-22.01.0\Library\bin')

for i in range(len(images)):
    images[i].save('page'+str(i)+'.jpg')

aws_access_key_id = ""
aws_secret_access_key = ""

textract = boto3.client('textract',
                        aws_access_key_id=aws_access_key_id,
                        aws_secret_access_key=aws_secret_access_key)

def create_table(images):
    image_data = images.tobytes()
    response = textract.analyze_document(
        Document={
            'Bytes': image_data
        },
        FeatureTypes=["TABLES"]
    )

    tables = []
    for item in response['Blocks']:
        if item['BlockType'] == 'TABLE':
            table = []
            cells = []
            for cell in item['Relationships']:
                row_index = cell['Type']
                col_index = cell['Type']
                text = cell['Text']
                cells.append((row_index, col_index, text))
            table.append(cells)

    return tables

with open('page0.csv', 'w', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(images)

def extract_tables_from_pdf(output_dir):

    for i, images in enumerate(images):
        table = create_table(images)
        csv_file = os.path.join(output_dir, f'table_{i}.csv')
        
        with open('page0.json','w') as f:
            csv.to_json.dumps(csv_file)
Posted
Comments
OriginalGriff 9-Feb-23 6:06am    
And?
What does it do that you didn't expect, or not do that you did?
What have you tried to do to find out why?
Are there any error messages, and if so, where and when? What did you do to make them happen?

This is not a good question - we cannot work out from that little what you are trying to do.
Remember that we can't see your screen, access your HDD, or read your mind - we only get exactly what you type to work with.
Use the "Improve question" widget to edit your question and provide better information.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900