Click here to Skip to main content
15,881,600 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
I am building a web scraper and I have found myself learning regex as I previously thought it to be necessary and vital for pattern searching (for header tags and links etc.). After some progress, I realise that my learning may have been unnecessary.

Whilst developing and progressing, I got as far as writing a program that can retrieve all the links (made distinct by the "href=" line) and then store them in a list. This meant that all I would simply need is a string version of the HTML and my regex to complete the task, however, my project requires me to Parse this HTML, making it more manipulatable.

Imagine a web scraper that functions like Beautiful Soup - but far more rudimentary, that would allow me to search HTML text/strings by tags or their class groups. If this were there case, what need would I have for my regex knowledge? I would like to incorporate it somehow, so if there are any suggestions please let me know.

What I have tried:

The code I use to find patterns of links is this:
Python
<pre> import re

txt = ' <!DOCTYPE html><html lang="en-US" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" /><title data-rh="true">BBC - Home</title><meta data-rh="true" name="description" content="The best of the BBC, with the latest news and sport headlines, weather, TV & radio highlights and much more from across the whole of BBC Online"/><meta data-rh="true" name="theme-color" content="#FFFFFF"/><link data-rh="true" rel="alternate" hrefLang="en-us" href="https://www.bbc.co.uk/"/><link data-rh="true" rel="alternate" hrefLang="en" href="https://www.bbc.com/"/><link data-rh="true" rel="canonical" href="https://www.bbc.co.uk/"/><link data-rh="true" rel="manifest" href="https://static.files.bbci.co.uk/core/manifest.8d4237cbd18eb052a5fa59995d4624b88fd4c643.json"/><script nonce="+zbYn/hoYrXoqqdHZPlxd8q+bzcm1JSSJ/+qIl3Rm0D271C/Ie"> '


ID = (re.findall('href="([htps\:\/]+)', txt)) #Find no. links (starting with https://) 
Count = len(ID) #4 links


Collected = []
while Count != 0:
  match = re.search('href="(?P<protocol>[htps\/\.\:]+)(?P<url>[a-zA-Z\.\/]+)', txt) #Redefines link with each cycle
  link = str(match.group('protocol') + match.group('url')) #Full url 
  
  Collected.append(link) #Appends link to list before it is removed 
  txt = txt.replace(link, '', 1) #Replaces the 1st instance noticed with '' and loops (avoids duplicates)

  Count -= 1

print(Collected) 


Any suggestions to improve this would be greatly appreciated as I am a novice and eager to improve.
Posted
Updated 23-Aug-21 8:27am

1 solution

Basically, as I said l
<pre lang="HTML">
ast time (How web scrape HTML in Python?[^]) you are much, much better off not using regex at all: HTML is inconsistent, impractical and - in essence - a mess.
While it is possible to use a regex to do it, it's a truly horrible regex you end up with, and it will be very difficult to maintain.
For an exercise, just think about br which can be represented two ways as <br> or <br /> but never as <br>...</br>; and how to identify nested paragraphs:
HTML
<p>hello<p>world</p></p>
Then look at "real world" sites and see how many contain malformed HTML with missing close tags ...

Use an HTML parser: you are making your whole project much, much harder than it needs to be!
 
Share this answer
 
Comments

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900