Click here to Skip to main content
15,886,199 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
See more:
I am planning to retrieve around 47M records from a Oracle DB to a flat file using python(Here I can not able to use other languages.). But it is taking nearly 2 hours to get it completed.

Can anyone please help me to get it done much in a faster way.

I have read a few articles regarding the same but could not be able to help.

Please find my code below which I have tried,

What I have tried:

import pandas as pd
import cx_Oracle
import datetime
import os

try:
    chunk_size = 1000000
    batch_no = 1
    start_time = datetime.datetime.now()
    print("Program Start Time:: ", start_time)
    dsn_tns = cx_Oracle.makedsn('HOSTNAME', 'PORT', service_name=r'SERVICE_NAME')
    conn = cx_Oracle.connect(user=r'USERNAME', password=r'PASSWORD', dsn=dsn_tns)
    query = """Select /*+ parallel(a,16) */ * from SCHEMA.TABLE_NAME a"""
    for chunk in pd.read_sql_query(query, con=conn, chunksize=chunk_size):
        chunk.to_csv(r'<path>\Test_Sample_Output_20230303_'+str(batch_no)+'.txt', sep='|',
                   quotechar='"', escapechar='\\', index=False, encoding='utf-8-sig')
        batch_no += 1
    end_time = datetime.datetime.now()
    print("Program End Time:: ", end_time)
    Exact_time_Taken = end_time - start_time
    print("Exact Time Taken :: ", Exact_time_Taken)

except cx_Oracle.DatabaseError as e:
    print("There is a problem with Oracle", e)
Posted
Comments
Richard MacCutchan 7-Mar-23 4:03am    
It's due to the laws of physics. 47,000,000 multiplied by: the time to retrieve the data, format it into an SQL response packet, convert it to a pands dataframe item, and convert that to a csv string, which then gets written to a flat file.
Dave Kreskowiak 7-Mar-23 7:43am    
On top of what Richard has said, another factor limiting your speed is the size of each record, times 47 million. How much data is that? What's the speed of your network between the code and the Oracle server?

Say each record is 2K bytes. That results in about 96 GIGABYTES of data being transferred!!

...and you want that transferred in less than 2 hours? You better be on a 10Gb network or faster.
[no name] 7-Mar-23 12:54pm    
Defragging the output drive and / or using bigger chucks might squeeze out a few more seconds or minutes.
Member 15627495 10-Mar-23 3:19am    
as you do, you load 'all' records in your query, the param 'chunk_size' split the 'sql result' every "1 000 000 rows". divide that job again.

did you try with a dynamic query using 'LIMIT' statement, to make the sql engine working instead of python RT ?
because you are really carrying lot lot lot of work ( and Datas ) at "read_sql_query(query, con=conn, chunksize=chunk_size)" and by your 'Sql query' too.

did you identify the 'heavy consumer' of time ( and so computing ) ?
add time counter at each instructions, to have a concrete answer about 'big consumer of time'.
Member 15627495 10-Mar-23 3:39am    
to achieve file writing in a better process, and avoid Operating System busy state, you must use a timelaps after the CSV writing.
the execution of writing operations will execute more softly, and avoid 'pressure' on the 'Os'. ( for a Time gain )
in python you have : sleep(n time value), explanation at

https://superfastpython.com/thread-sleep-in-python/

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900