How to do the OCR project with Python

Question

1.00/5 (3 votes)

See more:

Step 1: firstly create an AWS account. And then create an IAM user with AWS
textract full access.
Step 2: After creating IAM used we get aws_access_key_id, and
aws_secret_access_key these keys are important to connect AWS with our local
python script.
Step 3: When the code triggers firstly it split a pdf file into images for that pdf2image
library used.
Step 4: After that one by one image pass to the createTable(image) function using
this function it fetches a table from the image and saves it into a CSV file.
Step 5: Step 4 gets repeated for the key-value pair.
Step 6: Once all files are created Code() function trigger
Step 7: In the code function, it takes only those CSV files whose header contains Test
name, Technology, value, and units this header is for thyrocare. We can change the
header according to reports.
Step 8: Then add that test data into JSON format.
Step 9: Once JSON is created it dumps into MongoDB. For connection of Python
and MongoDB pymongo library use.
Step 10: After copy the JSON file and pdf file into the pdf collection folder. And
others CSV, the image file gets deleted

What I have tried:

from pdf2image import convert_from_path
import boto3
import csv
import os

images = convert_from_path('table.pdf',500) #poppler_path=r'C:\Program Files\poppler-22.01.0\Library\bin')

for i in range(len(images)):
    images[i].save('page'+str(i)+'.jpg')

aws_access_key_id = ""
aws_secret_access_key = ""

textract = boto3.client('textract',
                        aws_access_key_id=aws_access_key_id,
                        aws_secret_access_key=aws_secret_access_key)

def create_table(images):
    image_data = images.tobytes()
    response = textract.analyze_document(
        Document={
            'Bytes': image_data
        },
        FeatureTypes=["TABLES"]
    )

    tables = []
    for item in response['Blocks']:
        if item['BlockType'] == 'TABLE':
            table = []
            cells = []
            for cell in item['Relationships']:
                row_index = cell['Type']
                col_index = cell['Type']
                text = cell['Text']
                cells.append((row_index, col_index, text))
            table.append(cells)

    return tables

with open('page0.csv', 'w', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(images)

def extract_tables_from_pdf(output_dir):

    for i, images in enumerate(images):
        table = create_table(images)
        csv_file = os.path.join(output_dir, f'table_{i}.csv')
        
        with open('page0.json','w') as f:
            csv.to_json.dumps(csv_file)

Posted 8-Feb-23 23:34pm

Aniket Feb2023

Add a Solution

Comments

OriginalGriff 9-Feb-23 6:06am

And?
What does it do that you didn't expect, or not do that you did?
What have you tried to do to find out why?
Are there any error messages, and if so, where and when? What did you do to make them happen?

This is not a good question - we cannot work out from that little what you are trying to do.
Remember that we can't see your screen, access your HDD, or read your mind - we only get exactly what you type to work with.
Use the "Improve question" widget to edit your question and provide better information.

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)