Click here to Skip to main content
15,887,350 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have below response that I got by sending GET request to some server (GET /k/302.html HTTP/1.0) using java socket connection.

HTTP/1.1 200 OK
Date: Thu, 25 Apr 2019 06:31:21 GMT
Server: Apache/2.4.29 (Ubuntu)
Last-Modified: Thu, 11 Apr 2019 11:44:58 GMT
ETag: "59-5863fb73cdcbb"
Accept-Ranges: bytes
Content-Length: 89
Vary: Accept-Encoding
Connection: close
Content-Type: text/html

<html>
	<body>
		<a href="/"> More pages </a>
		<img src="redback.jpg">
	</body>
</html>
Connection closed by foreign host.

I have to write simple java code where I am suppose to crawl all the urls present on this current webpage (/k/302.html).
Currently I am able to extract the first url ("/") using java regular expression as <pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"
.

But I am not able to get the second url which is for tag.

Below is the expanded html content that I got from console where it clearly specifies that "redback.jpg" has hyperlink.

HTML
<span class="html-tag"><img <span class="html-attribute-name">src</span>="<a class="html-attribute-value html-resource-link" target="_blank" href="redback.jpg" rel="noreferrer noopener">redback.jpg</a>"></span>


But if we see the GET response it does not clearly tells that it has hyperlink. How to extract such urls from response only? I have to do this in simple java using socket connection with HTTP standard request without use of any other external libraries.



What I have tried:

For simple url I tried using java regex
<pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"
. but not getting how to get for embedded href tags because I do not get such information in HTTP GET response.
Posted
Updated 24-Apr-19 20:54pm

1 solution

You can search for href= to get relative urls. Basically if a regex doesn't work, work out the cases where it doesn't and string mash them
 
Share this answer
 
Comments
SGAU 25-Apr-19 4:28am    
That I already did and its able to fetch the first link but not the second one inside img tag
Christian Graus 25-Apr-19 18:04pm    
Read my answer again. If you identify the situations where regex does not work (I'd say that's all relative links), you can find them ALL by searching for tag names

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900