I have below response that I got by sending GET request to some server (GET /k/302.html HTTP/1.0) using java socket connection.
HTTP/1.1 200 OK
Date: Thu, 25 Apr 2019 06:31:21 GMT
Server: Apache/2.4.29 (Ubuntu)
Last-Modified: Thu, 11 Apr 2019 11:44:58 GMT
ETag: "59-5863fb73cdcbb"
Accept-Ranges: bytes
Content-Length: 89
Vary: Accept-Encoding
Connection: close
Content-Type: text/html
<html>
<body>
<a href="/"> More pages </a>
<img src="redback.jpg">
</body>
</html>
Connection closed by foreign host.
I have to write simple java code where I am suppose to crawl all the urls present on this current webpage (/k/302.html).
Currently I am able to extract the first url ("/") using java regular expression as <pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"
.
But I am not able to get the second url which is for tag.
Below is the expanded html content that I got from console where it clearly specifies that "redback.jpg" has hyperlink.
<span class="html-tag"><img <span class="html-attribute-name">src</span>="<a class="html-attribute-value html-resource-link" target="_blank" href="redback.jpg" rel="noreferrer noopener">redback.jpg</a>"></span>
But if we see the GET response it does not clearly tells that it has hyperlink. How to extract such urls from response only? I have to do this in simple java using socket connection with HTTP standard request without use of any other external libraries.
What I have tried:
For simple url I tried using java regex
<pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"
. but not getting how to get for embedded href tags because I do not get such information in HTTP GET response.