Click here to Skip to main content
15,890,825 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello everyone.
I have a website that I need to get ~1000 pdfs from. The pdfs are differed by a four digit number between 1101-2300, like
https://intvd.gib.gov.tr/2014_Emlak_Arsa/EmlakServlet?tip=9&ilKodu=3&ilceKodu=1101
Some of the numbers between the range are not assigned to a pdf though, so I needed something that would
1-) dowload all the pdfs
2-) delete the pdfs that are 1KB (these are non-assigned ones)
3-) merge all the pdf files into one pdf file
There were answers to each of these steps but not together, so I looked at those and made something. In the end though, all I get is a 1 KB pdf file called merged_full.pdf
What am I doing wrong?
Cheers

What I have tried:

Python
import urllib.request
import os
from PyPDF2 import PdfFileReader, PdfFileMerger

os.chdir('the_directory')

mylist=(list(range(1101,2500)))

for i in mylist:
    def download_file(download_url):
        web_file = urllib.request.urlopen('https://intvd.gib.gov.tr/2014_Emlak_Arsa/EmlakServlet?tip=9&ilKodu=3&ilceKodu=%d'%(i),'%d.pdf'%(i))
        local_file = open('%d.pdf'%(i), 'wb')
        local_file.write(web_file.read())
        web_file.close()
        local_file.close()
        filesize = os.path.getsize('%d.pdf'%(i))
        if filesize<1024:
                os.remove('%d.pdf'%(i))
        del filesize

files_dir = "the_directory"
pdf_files = [f for f in os.listdir(files_dir) if f.endswith("pdf")]
merger = PdfFileMerger()

for filename in pdf_files:
    merger.append(PdfFileReader(os.path.join(files_dir, filename), "rb"))

merger.write(os.path.join(files_dir, "merged_full.pdf"))
Posted
Updated 10-Aug-17 3:35am

1 solution

A size of 1 KB indicates that the created file is just an empty PDF.

You should insert checks in your code to see if each step is working as expected:

Are the files downloaded?
Which sizes have the existing file?
Are the files listed in your pdf_files?
Does PdfFileReader() reads files?
Is the content appended to the merger?

According to the PdfFileMerger documentation there should be no need to use a reader. Just pass the path to the append function (creating already the stream object here):
Python
merger.append(file(os.path.join(files_dir, filename), 'rb'))
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900