Question

I am trying to do some web crawling and I came across an issue of when to add a slash or not. I know that some sites do have it at the end and some don't but entering the wrong one in the browser will just redirect you to the right one. Normalization would add the slash at the end but its going to cause a problem when trying to convert the relative URLs to absolute.

For example if a user selects an absolute URL http://stack.com/more but the actual (redirect) URL is http://stack.com/more/ and a relative url is index.html

Then doing URL newurl = new URL(url, relativeURL);

yields http://stack.com/index.html (non existant page)

when it should actually be http://stack.com/more/index.html(real page)

Doese anyone know a good way to correctly append the slash at the end?

Was it helpful?

Solution

If a relative URL starts with a /, it's only relative to the root (the domain). So both

http://stack.com/more/ + /index.html

and

http://stack.com/more + /index.html

are correctly resolved to

http://stack.com/index.html

not

http://stack.com/more/index.html

In your example, it makes no difference whatsoever whether there's a / at the end of more.

The trick comes in when there's no leading slash on the relative URL, e.g. index.html. When resolving those, you're supposed to remove the last segment and replace it with the relative path. It would make a difference in that case, because

http://stack.com/more/ + index.html

resolves to

http://stack.com/more/index.html

but

http://stack.com/more + index.html

resolves to

http://stack.com/index.html

(index.html replaces more, because more is the final segment).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top