Click here to Skip to main content
15,887,485 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Hello! I am trying to create a simple python webscraping script to extract the content outside of span tags. The HTML I am working with is very simple, mainly consisting of a single body tag and multiple span tags.

HTML
<body>
<span id="line2"></span>NUM=2039
<span id="line3"></span>NAME0=Record_Continuously/2018-04-24/71/MP_2018-04-24_10-27-02_60.mov
<span id="line4"></span>SIZE0=15747369
<span id="line5"></span>NAME1=Record_Continuously/2018-04-24/71/MP_2018-04-24_10-26-01_60.mov
<span id="line6"></span>SIZE1=15725278


Any
I am wanting to extract the MP_2018-04-24_10-26-01_60.mov files as text. Any feedback is welcome!

What I have tried:

Python
from lxml import html
import requests

page = requests.get('http://192.168.1.99/form/getStorageFileList')
tree = html.fromstring(page.content)
clips = tree.xpath('//span[@="NAME"]/text()')
print(clips)


My output is [] and nothing else. I'm assuming because the content within the span tag is whitespace. Should I try to extract body tag instead? Thanks in advance for any ideas or feedback!
Posted
Comments
Richard Deeming 25-Apr-18 14:20pm    
Is that the actual HTML you're retrieving, or a typo in your question? None of the text is inside a <span>.
Member 13424161 25-Apr-18 15:27pm    
Sorry for not being clear. It's not a typo. the HTML I'm trying to retrieve only consist of a body and span tags, the .mov files I'm wanting to retrieve are located outside of these span tags. When I retrieve the contents of the Body tag, it obviously outputs all of the HTML and is not what I'm looking for. I'm looking for some incite as to only retrieve the MP_2018-04-24_10-26-01_60.mov files from the HTML.
Richard Deeming 25-Apr-18 16:28pm    
That's a pain. Assuming your HTML parser supports it, you'll need to get the next sibling node following the matching span.

Something like this might work:
clips = tree.xpath('//span[@id="line2"]/following-sibling::*[1]/text()')
Member 13424161 1-May-18 9:53am    
Sorry for the late reply! I had to speak with the manufacturer on how to bypass their authentication, I kept getting a 401 error with any scripts I tried. Now Im going to try what you suggested, thanks!

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900