Click here to Skip to main content
15,904,494 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Python
from bs4 import BeautifulSoup
import html5lib
import requests

tw_link = open("TW_Links.txt","r")
im_link = open("IMG_Links.txt","w+")

def get_images(urli):
  rs = requests.Session()
  urls=rs.get(urli)
  soup = BeautifulSoup(urls.text , "html5lib")
  #print(soup.prettify())
  content = soup.find("div", {"class": "tt_article_useless_p_margin"})
  images = content.findAll('img')
  for img in images:
    img_url = img['src']+"?original"
    print(img_url,file=im_link)

def get_links():
  count=1
  for line in tw_link:
    print(line,count)
    count+=1
    get_images(line)
get_links()


What I have tried:

<pre>The code seems to work fine when using a single link, but when i pass the urls to the function i'm getting the following error.<br />
<br />
AttributeError Traceback (most recent call last) in () 23 count+=1 24 get_images(line) ---> 25 get_links()<br />
<br />
1 frames in get_links() 22 print(line,count) 23 count+=1 ---> 24 get_images(line) 25 get_links()<br />
<br />
in get_images(urli) 12 print(soup.prettify()) 13 content = soup.find("div", {"class": "tt_article_useless_p_margin"}) ---> 14 images = content.findAll('img') 15 for img in images: 16 img_url = img['src']+"?original"<br />
<br />
AttributeError: 'NoneType' object has no attribute 'findAll'


My guess is that i'm triggering some sort of Bot Detection (because when passing a single link different page is loaded not the one that's being loaded currently), is there any way to bypass that..? I've tried using time.sleep(5) but that also didn't work
Posted
Updated 19-Dec-20 4:15am
v2
Comments
Richard MacCutchan 18-Dec-20 9:27am    
The error message is telling you that the variable named content does not contain a valid reference to an object. Which in turn probably means that soup.find in the line above, did not find the relevant HTML tag. You will need to do some debugging to find out why it fails.
SHIVAM SAH 19-Dec-20 2:15am    
so far i've tried using the encoding='utf-8' while reading the file yet it still seems to fail.
Richard MacCutchan 19-Dec-20 5:01am    
It is no good randomly changing things in the hope that the problem will go away. Do some proper debugging and find out why the failure occurs. Only then can you reliably modify the code to correct it.

1 solution

Python
from bs4 import BeautifulSoup
import html5lib
import requests
import time

tw_link = open("TW_Links.txt","r", encoding = 'utf-8')
im_link = open("DCDN_Links.txt","w+")
kak_link = open("KCDN_Links.txt","w+")

def get_images(urlset):
  for x in urlset:
    rs = requests.Session()
    urls=rs.get(x)
    soup = BeautifulSoup(urls.text , "html5lib")
    content = soup.find("div", {"class": "tt_article_useless_p_margin"})
    images = content.findAll('img')
    for img in images:
      img_url = img['src']+"?original"
      if "blog" in img_url:
        print(img_url,file=kak_link)
        print(img_url)
      print(img_url,file=im_link)
      print(img_url)
    #print(x)
    time.sleep(2)

def get_links():
  count=1
  linklist = []
  for line in tw_link:
    line = line.replace("\n","")
    linklist.append(line)
  get_images(linklist)
get_links()



For those waiting for a solution, it was pretty simple, i was doubtful of the request module so i intercepted the traffic from the program using proxy and voila turns out the request module also included EOL symbol in the request as well, while it might've worked with most sites this particular site redirected to the 404 Page, so a simple removal of "\n" from the lines read did the trick.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900