Basically, as I said l
<pre lang="HTML">
ast time (
How web scrape HTML in Python?[
^]) you are much, much better off not using regex at all: HTML is inconsistent, impractical and - in essence - a mess.
While it is possible to use a regex to do it, it's a truly horrible regex you end up with, and it will be very difficult to maintain.
For an exercise, just think about
br
which can be represented two ways as
<br>
or <br /> but never as
<br>...</br>
; and how to identify nested paragraphs:
<p>hello<p>world</p></p>
Then look at "real world" sites and see how many contain malformed HTML with missing close tags ...
Use an HTML parser: you are making your whole project much, much harder than it needs to be!