Step 1: firstly create an AWS account. And then create an IAM user with AWS textract full access. Step 2: After creating IAM used we get aws_access_key_id, and aws_secret_access_key these keys are important to connect AWS with our local python script. Step 3: When the code triggers firstly it split a pdf file into images for that pdf2image library used. Step 4: After that one by one image pass to the createTable(image) function using this function it fetches a table from the image and saves it into a CSV file. Step 5: Step 4 gets repeated for the key-value pair. Step 6: Once all files are created Code() function trigger Step 7: In the code function, it takes only those CSV files whose header contains Test name, Technology, value, and units this header is for thyrocare. We can change the header according to reports. Step 8: Then add that test data into JSON format. Step 9: Once JSON is created it dumps into MongoDB. For connection of Python and MongoDB pymongo library use. Step 10: After copy the JSON file and pdf file into the pdf collection folder. And others CSV, the image file gets deleted
from pdf2image import convert_from_path import boto3 import csv import os images = convert_from_path('table.pdf',500) #poppler_path=r'C:\Program Files\poppler-22.01.0\Library\bin') for i in range(len(images)): images[i].save('page'+str(i)+'.jpg') aws_access_key_id = "" aws_secret_access_key = "" textract = boto3.client('textract', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key) def create_table(images): image_data = images.tobytes() response = textract.analyze_document( Document={ 'Bytes': image_data }, FeatureTypes=["TABLES"] ) tables = [] for item in response['Blocks']: if item['BlockType'] == 'TABLE': table = [] cells = [] for cell in item['Relationships']: row_index = cell['Type'] col_index = cell['Type'] text = cell['Text'] cells.append((row_index, col_index, text)) table.append(cells) return tables with open('page0.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(images) def extract_tables_from_pdf(output_dir): for i, images in enumerate(images): table = create_table(images) csv_file = os.path.join(output_dir, f'table_{i}.csv') with open('page0.json','w') as f: csv.to_json.dumps(csv_file)
var
This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)