Click here to Skip to main content
15,887,267 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I'm trying to scrape data into a CSV file from a website that lists contact information for people in my industry. My code works well until I get to a page where one of the entries doesn't have a specific item.

So for example:

I'm trying to collect

Name, Phone, Profile URL

If there isn't a phone number listed for a specific entry, there isn't even a tag for that entry with blank text, and my code errors out with

"IndexError: list index out of range"


The code I pasted below is what works for me across a few websites (with xpaths/urls changed obviously) as long as all fields exist on the page with relevant tags. But if one of the //div[contains(@class, "agent-phone")] tags isn't on one of the listings, it errors out.

Python
from selenium import webdriver
driver = webdriver.Firefox()


MAX_PAGE_NUM = 23
MAX_PAGE_DIG = 2

with open('results.csv', 'w') as f:
    f.write("Name, Number, URL \n")

#Run Through Pages,
for i in range(1, MAX_PAGE_NUM + 1):
    page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
    website = "https://www.website.com/area/pg-" + page_num
    driver.get(website)

    Name = driver.find_elements_by_xpath('//div[contains(@class, "agent-name")]/a')
    Number = driver.find_elements_by_xpath('//div[contains(@class, "agent-phone")]')
    URL = driver.find_elements_by_xpath('//div[contains(@class, "agent-name")]/a')

    num_page_items = len(Name)
    for i in range(num_page_items):
        print(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")

    with open('results.csv', 'a') as f:
        for i in range(num_page_items):
            f.write(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")

driver.close()


What I have tried:

I tried continue when I encountered an Index Error. The problem is, the error doesn't occur until the end of the page, which means the listing without a phone number just picks up a phone number from the next listing, making the phone numbers column out of order.

Python
num_page_items = len(Name)
with open('results.csv', 'a') as f:
    for i in range(num_page_items):
        try:
            f.write(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
            print(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
        except IndexError:
            f.write("Nothing, Nothing, Nothing \n")
            print("No element found")
            continue


I'm trying to figure out how to check to see if all elements exist on each entry. If one or more is missing, either skip that entire entry or just put "Empty" for that cell in the CSV. I've tried various things with NoSuchElementException, but I just can't get anything to fire.

I'm fairly new to all this. Thanks in advance for any help.
Posted
Updated 29-Aug-19 2:58am
Comments
F-ES Sitecore 28-Aug-19 4:14am    
What you're doing is unethical. If you want data that someone else has spent time, effort and cost accruing then ask them for it, if they don't mind giving it away they will have an API or data dump that you can use.
FakeHelicopterPilot 28-Aug-19 9:44am    
F-ES. I think that's a matter of opinion. It's akin to taking a picture of a phone book. If the information is available, there's nothing wrong with using it. The information I'm scraping is there specifically for business purposes. I'm just making it more digestible for how we need to use it. Not every website is advanced enough to have an API.
MadMyche 28-Aug-19 12:07pm    
I think it is a matter of the website's Copyright and their Terms of Service.
FakeHelicopterPilot 28-Aug-19 12:49pm    
If I plan to sell the data or offer it publicly, that's a matter of copyright. As far as terms of service are concerned, I'm not incredibly worried about that, and that's where opinions come in as far as ethics are concerned.

If public data is readily available on the internet that can be compiled to make my life easier and save me an irreplaceable commodity, my time, I'm going to take advantage of that opportunity.

If I was rain man, I could just click through manually and remember each entry. Python is my personal rain man.

1 solution

Use an extra coma or insert a "place holder" for fields that are missing. What else.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900