Python webscraping, extracting content outside of span tags

Question

0.00/5 (No votes)

See more:

Hello! I am trying to create a simple python webscraping script to extract the content outside of span tags. The HTML I am working with is very simple, mainly consisting of a single body tag and multiple span tags.

HTML

<body>
<span id="line2"></span>NUM=2039
<span id="line3"></span>NAME0=Record_Continuously/2018-04-24/71/MP_2018-04-24_10-27-02_60.mov
<span id="line4"></span>SIZE0=15747369
<span id="line5"></span>NAME1=Record_Continuously/2018-04-24/71/MP_2018-04-24_10-26-01_60.mov
<span id="line6"></span>SIZE1=15725278

Any
I am wanting to extract the MP_2018-04-24_10-26-01_60.mov files as text. Any feedback is welcome!

What I have tried:

Python

from lxml import html
import requests

page = requests.get('http://192.168.1.99/form/getStorageFileList')
tree = html.fromstring(page.content)
clips = tree.xpath('//span[@="NAME"]/text()')
print(clips)

My output is [] and nothing else. I'm assuming because the content within the span tag is whitespace. Should I try to extract body tag instead? Thanks in advance for any ideas or feedback!

Posted 24-Apr-18 7:25am

Member 13424161

Add a Solution

Comments

Richard Deeming 25-Apr-18 14:20pm

Is that the actual HTML you're retrieving, or a typo in your question? None of the text is inside a <span>.

Member 13424161 25-Apr-18 15:27pm

Sorry for not being clear. It's not a typo. the HTML I'm trying to retrieve only consist of a body and span tags, the .mov files I'm wanting to retrieve are located outside of these span tags. When I retrieve the contents of the Body tag, it obviously outputs all of the HTML and is not what I'm looking for. I'm looking for some incite as to only retrieve the MP_2018-04-24_10-26-01_60.mov files from the HTML.

Richard Deeming 25-Apr-18 16:28pm

That's a pain. Assuming your HTML parser supports it, you'll need to get the next sibling node following the matching span.

Something like this might work:

clips = tree.xpath('//span[@id="line2"]/following-sibling::*[1]/text()')

Member 13424161 1-May-18 9:53am

Sorry for the late reply! I had to speak with the manufacturer on how to bypass their authentication, I kept getting a 401 error with any scripts I tried. Now Im going to try what you suggested, thanks!

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)