Python code for downloading and merging pdf files in a loop results in a 1KB endfile

Question

0.00/5 (No votes)

See more:

Hello everyone.
I have a website that I need to get ~1000 pdfs from. The pdfs are differed by a four digit number between 1101-2300, like
https://intvd.gib.gov.tr/2014_Emlak_Arsa/EmlakServlet?tip=9&ilKodu=3&ilceKodu=1101
Some of the numbers between the range are not assigned to a pdf though, so I needed something that would
1-) dowload all the pdfs
2-) delete the pdfs that are 1KB (these are non-assigned ones)
3-) merge all the pdf files into one pdf file
There were answers to each of these steps but not together, so I looked at those and made something. In the end though, all I get is a 1 KB pdf file called merged_full.pdf
What am I doing wrong?
Cheers

What I have tried:

Python

import urllib.request
import os
from PyPDF2 import PdfFileReader, PdfFileMerger

os.chdir('the_directory')

mylist=(list(range(1101,2500)))

for i in mylist:
    def download_file(download_url):
        web_file = urllib.request.urlopen('https://intvd.gib.gov.tr/2014_Emlak_Arsa/EmlakServlet?tip=9&ilKodu=3&ilceKodu=%d'%(i),'%d.pdf'%(i))
        local_file = open('%d.pdf'%(i), 'wb')
        local_file.write(web_file.read())
        web_file.close()
        local_file.close()
        filesize = os.path.getsize('%d.pdf'%(i))
        if filesize<1024:
                os.remove('%d.pdf'%(i))
        del filesize

files_dir = "the_directory"
pdf_files = [f for f in os.listdir(files_dir) if f.endswith("pdf")]
merger = PdfFileMerger()

for filename in pdf_files:
    merger.append(PdfFileReader(os.path.join(files_dir, filename), "rb"))

merger.write(os.path.join(files_dir, "merged_full.pdf"))

Posted 10-Aug-17 2:30am

Member 13355404

Updated 10-Aug-17 3:35am

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Jochen Arndt · Answer 1 · 2017-08-10T03:35:00

A size of 1 KB indicates that the created file is just an empty PDF.

You should insert checks in your code to see if each step is working as expected:

Are the files downloaded?
Which sizes have the existing file?
Are the files listed in your pdf_files?
Does PdfFileReader() reads files?
Is the content appended to the merger?

According to the PdfFileMerger documentation there should be no need to use a reader. Just pass the path to the append function (creating already the stream object here):

Python

merger.append(file(os.path.join(files_dir, filename), 'rb'))