Click here to Skip to main content
15,889,442 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am a newbie to python. I want to extract the name of categories and webpages (category tree) of a wikipedia page having a category through the crawling procedure. During the course of this I am facing the following error and I am frustrated with an error. In regard, any help is greatly appreciated.

Downloading
Traceback (most recent call last):
File "C:\Users\SIBA\Desktop\PDF\Code\trialcode.py", line 100, in <module>
printTree(name, 0)
File "C:\Users\SIBA\Desktop\PDF\Code\trialcode.py", line 80, in printTree
content = open("categories/Category:"+catName+".html").readlines()
FileNotFoundError: [Errno 2] No such file or directory: 'categories/Category:Cricket.html'

The code snippet of what I have tried is as follows. I am using Python 3.6 version.

What I have tried:

Python
#Imports
import httplib2
from bs4 import BeautifulSoup
import subprocess
import time
import os,sys
os.path.dirname(sys.argv[0])

#declarations
catRoot = "http://en.wikipedia.org/wiki/Category:"
MAX_DEPTH = 100
done = []
ignore = []
# Removes all newline characters and replaces with spaces
def removeNewLines(in_text):
return in_text.replace('\n', ' ')

# Downloads a link into the destination
def download(link, dest):
# print link
if not os.path.exists(dest) or os.path.getsize(dest) == 0:
subprocess.getoutput('wget "' + link + '" -O "' + dest+ '"')
print ("Downloading")

def ensureDir(f):
    if not os.path.exists(f):
    os.makedirs(f)

# Cleans a text by removing tags
def clean(in_text):
s_list = list(in_text)
i,j = 0,0
while i < len(s_list):
    # iterate until a left-angle bracket is found
    if s_list[i] == '<':
        if s_list[i+1] == 'b' and s_list[i+2] == 'r' and s_list[i+3] == '>':
            i=i+1
            print (hello)
            continue
        while s_list[i] != '>':
            # pop everything from the the left-angle bracket until the right-angle bracket
            s_list.pop(i)
        # pops the right-angle bracket, too
        s_list.pop(i)

    elif s_list[i] == '\n':
        s_list.pop(i)
    else:
        i=i+1

# convert the list back into text
join_char=''
return (join_char.join(s_list))#.replace("<br>","\n")

# Gets bullets
def getBullets(content):
    mainSoup = BeautifulSoup(contents)

# Gets empty bullets
def getAllBullets(content):
mainSoup = BeautifulSoup(str(content))
subcategories = mainSoup.findAll('div',attrs={"class" :  "CategoryTreeItem"})
empty = []
full = []
for x in subcategories:
    subSoup = BeautifulSoup(str(x))
    link = str(subSoup.findAll('a')[0])
    if (str(x)).count("CategoryTreeEmptyBullet") > 0:
        empty.append(clean(link).replace(" ","_"))
    elif (str(x)).count("CategoryTreeBullet") > 0:
        full.append(clean(link).replace(" ","_"))

return((empty,full))

def printTree(catName, count):
catName = catName.replace("\\'","'")
if count == MAX_DEPTH: return
   path='trivial'
   download(catRoot+catName, path)
content = ("Category:"+catName+".html")
filepath=open("content")
(emptyBullets,fullBullets) = getAllBullets(content)
f.close()

for x in emptyBullets:
    for i in range(count): print ("  "),
    download(catRoot+x, "categories/Category:"+x+".html")
    print (x)

for x in fullBullets:
    for i in range(count): print ("  "),
    print (x)
    if x in done:
        print ("Done... "+x)
        continue
    done.append(x)
    try: printTree(x, count + 1)
    except: print ("ERROR: " + x)

name = "Cricket"
printTree(name, 0)
Posted
Updated 11-Feb-20 19:11pm

1 solution

The error message is quite clear: The mentioned file does not exist.

But the posted code has indentation errors and does not correspond to the code line from the error message so that it is rather impossible to help by just seeing the posted code.

In any case you can check if the file exists before trying to open it and act accordingly.

Note also that using relative pathes is prone to errors.

Finally, you should check if execution of the wget tool was successful. Otherwise, the file is not created.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900