Print only absolute URLs

https://stackoverflow.com/questions/22242927

10-06-2023
|

题

I wrote a simple Java Web Crawler that lets the user type in any web page and it will search through the page and pull out the links as Strings. I am not using a package like Jsoup. My question is, how do I only print the absolute URLs rather than both relative and absolute URLs?

解决方案

Inspect the src or href attribute to see if it's absolute, relative, or protocol-relative (//stackoverflow.com/file). Parse the page's URL. If the tag was protocol-relative, use the protocol from the parsed page URL, then append the content of the attribute. If it's relative, strip the query string and fragment IF from the original URL, and "append" the relative portion. Be aware that a relative URL can look like /foo, foo, foo/bar, or ./../../bar/../foo, so you might want to resolve path traversals before printing.

Edit:

Take a look at URL and the Commons URL Builder. They'll both be helpful.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow