Question

I wrote a simple Java Web Crawler that lets the user type in any web page and it will search through the page and pull out the links as Strings. I am not using a package like Jsoup. My question is, how do I only print the absolute URLs rather than both relative and absolute URLs?

Was it helpful?

Solution

Inspect the src or href attribute to see if it's absolute, relative, or protocol-relative (//stackoverflow.com/file). Parse the page's URL. If the tag was protocol-relative, use the protocol from the parsed page URL, then append the content of the attribute. If it's relative, strip the query string and fragment IF from the original URL, and "append" the relative portion. Be aware that a relative URL can look like /foo, foo, foo/bar, or ./../../bar/../foo, so you might want to resolve path traversals before printing.

Edit:

Take a look at URL and the Commons URL Builder. They'll both be helpful.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top