How could I use regex in my web scraping project?

Question

1.00/5 (1 vote)

See more:

, +

I am building a web scraper and I have found myself learning regex as I previously thought it to be necessary and vital for pattern searching (for header tags and links etc.). After some progress, I realise that my learning may have been unnecessary.

Whilst developing and progressing, I got as far as writing a program that can retrieve all the links (made distinct by the "href=" line) and then store them in a list. This meant that all I would simply need is a string version of the HTML and my regex to complete the task, however, my project requires me to Parse this HTML, making it more manipulatable.

Imagine a web scraper that functions like Beautiful Soup - but far more rudimentary, that would allow me to search HTML text/strings by tags or their class groups. If this were there case, what need would I have for my regex knowledge? I would like to incorporate it somehow, so if there are any suggestions please let me know.

What I have tried:

The code I use to find patterns of links is this:

Python

<pre> import re

txt = ' <!DOCTYPE html><html lang="en-US" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" /><title data-rh="true">BBC - Home</title><meta data-rh="true" name="description" content="The best of the BBC, with the latest news and sport headlines, weather, TV & radio highlights and much more from across the whole of BBC Online"/><meta data-rh="true" name="theme-color" content="#FFFFFF"/><link data-rh="true" rel="alternate" hrefLang="en-us" href="https://www.bbc.co.uk/"/><link data-rh="true" rel="alternate" hrefLang="en" href="https://www.bbc.com/"/><link data-rh="true" rel="canonical" href="https://www.bbc.co.uk/"/><link data-rh="true" rel="manifest" href="https://static.files.bbci.co.uk/core/manifest.8d4237cbd18eb052a5fa59995d4624b88fd4c643.json"/><script nonce="+zbYn/hoYrXoqqdHZPlxd8q+bzcm1JSSJ/+qIl3Rm0D271C/Ie"> '


ID = (re.findall('href="([htps\:\/]+)', txt)) #Find no. links (starting with https://) 
Count = len(ID) #4 links


Collected = []
while Count != 0:
  match = re.search('href="(?P<protocol>[htps\/\.\:]+)(?P<url>[a-zA-Z\.\/]+)', txt) #Redefines link with each cycle
  link = str(match.group('protocol') + match.group('url')) #Full url 
  
  Collected.append(link) #Appends link to list before it is removed 
  txt = txt.replace(link, '', 1) #Replaces the 1st instance noticed with '' and loops (avoids duplicates)

  Count -= 1

print(Collected)

Any suggestions to improve this would be greatly appreciated as I am a novice and eager to improve.

Posted 23-Aug-21 7:20am

Edgr78

Updated 23-Aug-21 8:27am

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Accepted Answer · 2021-08-23T08:27:00

Basically, as I said l

<pre lang="HTML">

ast time (How web scrape HTML in Python?[^]) you are much, much better off not using regex at all: HTML is inconsistent, impractical and - in essence - a mess.
While it is possible to use a regex to do it, it's a truly horrible regex you end up with, and it will be very difficult to maintain.
For an exercise, just think about br which can be represented two ways as <br> or <br /> but never as <br>...</br>; and how to identify nested paragraphs:

HTML

<p>hello<p>world</p></p>

Then look at "real world" sites and see how many contain malformed HTML with missing close tags ...

Use an HTML parser: you are making your whole project much, much harder than it needs to be!