Why I get null returned when I do web scraping with xpath

Question

1.00/5 (1 vote)

See more:

Hello, I am Caixia, trying to grab some housing transaction information from the webpage, i did it with xpath, some information is scrapped successfully, but some is not. I can confirm that the xpath is correct. My programme is following:

from bs4 import BeautifulSoup as soup
import re
import requests
from parsel import selector
from lxml import etree
import pandas as pd
import openpyxl
import time

# Get Chrome browser header information
# 在http请求中设置头部信息，以免被封ip;
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}

finalurl = 'https://sz.lianjia.com/xiaoqu/luohuqu/pg1cro11/'
print(finalurl)
get_finalurl = requests.get(finalurl, headers = headers).content
finalurl_soup = soup(get_finalurl, 'html.parser')
finalurl_ul = finalurl_soup.find('ul', attrs={'class': 'listContent'})
findurl_all = finalurl_ul.find_all('a', attrs={'class': 'img'})

for all in findurl_all:
re_url = requests.get(all['href'])
re_html = re_url.text
re_html_e = etree.HTML(re_html)
info = { }

#xiaoqu base information
info['link'] = all['href']
info['xiaoqu_name'] = re_html_e.xpath('/html/body/div[4]/div/div[1]/h1/text()')[0]
info['xiaoqu_address'] = re_html_e.xpath('/html/body/div[4]/div/div[1]/div/text()')[0]
info['xiaoqu_city'] = re_html_e.xpath('/html/body/div[5]/div[1]/a[2]/text()')[0]
info['xiqoqu_chengqu'] = re_html_e.xpath('/html/body/div[5]/div[1]/a[3]/text()')[0]
info['xiaoqu_jiedao'] = re_html_e.xpath('/html/body/div[5]/div[1]/a[4]/text()')[0]
info['xiaoqu_price'] = re_html_e.xpath('/html/body/div[6]/div[2]/div[1]/div/span[1]/text()')[0]
info['xiaoqu_age'] = re_html_e.xpath('/html/body/div[6]/div[2]/div[2]/div[1]/span[2]/text()')[0]
info['xiaoqu_buildtype'] = re_html_e.xpath('/html/body/div[6]/div[2]/div[2]/div[2]/span[2]/text()')[0]
info['xiaoqu_wuyefei'] = re_html_e.xpath('/html/body/div[6]/div[2]/div[2]/div[3]/span[2]/text()')[0]
info['xiaoqu_wuyecompany'] = re_html_e.xpath('/html/body/div[6]/div[2]/div[2]/div[4]/span[2]/text()')[0]
info['xiaoqu_developer'] = re_html_e.xpath('/html/body/div[6]/div[2]/div[2]/div[5]/span[2]/text()')[0]
info['xiaoqu_building_number'] = re_html_e.xpath('/html/body/div[6]/div[2]/div[2]/div[6]/span[2]/text()')[0]
info['xiaoqu_house_number'] = re_html_e.xpath('/html/body/div[6]/div[2]/div[2]/div[7]/span[2]/text()')[0]

#xiaoqu around information

info['xiaoqu_jiaotong_subway1_distance'] = re_html_e.xpath('//*[@id="mapListContainer"]/ul/li[1]/div/div[1]/span[4]/text()')
info['xiaoqu_jiaotong_subway2_distance'] = re_html_e.xpath('//*[@id="mapListContainer"]/ul/li[2]/div/div[1]/span[4]/text()')
info['xiaoqu_jiaotong_subway3_distance'] = re_html_e.xpath('//*[@id="mapListContainer"]/ul/li[3]/div/div[1]/span[4]/text()')

print(info)

The last three variables is null. Why?

What I have tried:

i have tried to scrape some basic community information variables and i got it. But the last three variables are null.

Posted 30-Sep-18 4:18am

Member 14002680

Add a Solution

Comments

Richard MacCutchan 30-Sep-18 10:48am

re_html_e.xpath('/html/body/div[6]/div[2]/div[2]/div[7]/span[2]/text()')[0]
With all those hard coded index values it could be any one of them that is incorrect. You must look at each set from the beginning and check that the elements you are looking for actually exist. Do not assume that the data will be in the format you want.

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)