Parsing links with JTidy

https://stackoverflow.com/questions/8568976

21-03-2021
|

Question

I am currently using JTidy to parse an HTML document and fetch a collection of all anchor tags in the given HTML document. I then extract the value of each tag's href attribute to come up with a collection of links on the page.

Unfortunately, these links can be expressed in a few different ways: some absolute (http://www.example.com/page.html), some relative (/page.html, page.html, or ../page.html). Even more, some can just be anchors (#paragraphA). When I visit my page in a browser, it knows automatically how to handle these different href values if I were to click the link, however if I were to follow one of these links retrieved from JTidy using an HTTPClient programatically, I first need to provide a valid URL (so e.g. I would first need to transform /page.html, page.html, and http://www.example.com/page.html to http://www.example.com/page.html).

Is there some built-in functionality, whether in JTidy or elsewhere, that can achieve this for me? Or will I need to create my own rules to transform these different URLs into an absolute URL?

Solution

The vanilla URL class might get you most of the way there, assuming you can work out which context to use. Here are some examples:

package grimbo.url;

import java.net.MalformedURLException;
import java.net.URL;

public class TestURL {
    public static void main(String[] args) {
        // context1
        URL c1 = u(null, "http://www.example.com/page.html");
        u(c1, "http://www.example.com/page.html");
        u(c1, "/page.html");
        u(c1, "page.html");
        u(c1, "../page.html");
        u(c1, "#paragraphA");

        System.out.println();

        // context2
        URL c2 = u(null, "http://www.example.com/path/to/page.html");
        u(c2, "http://www.example.com/page.html");
        u(c2, "/page.html");
        u(c2, "page.html");
        u(c2, "../page.html");
        u(c2, "#paragraphA");
    }

    public static URL u(URL context, String url) {
        try {
            URL u = null != context ? new URL(context, url) : new URL(url);
            System.out.println(u);
            return u;
        } catch (MalformedURLException e) {
            e.printStackTrace();
            return null;
        }
    }
}

Results in:

http://www.example.com/page.html
http://www.example.com/page.html
http://www.example.com/page.html
http://www.example.com/page.html
http://www.example.com/../page.html
http://www.example.com/page.html#paragraphA

http://www.example.com/path/to/page.html
http://www.example.com/page.html
http://www.example.com/page.html
http://www.example.com/path/to/page.html
http://www.example.com/path/page.html
http://www.example.com/path/to/page.html#paragraphA

As you can see, there are some results that aren't what you want. So maybe you try and parse the URL using new URL(value) first, and if that results in a MalformedURLException you could try relative to a context URL.

OTHER TIPS

Your best best is most likely to follow the same resolution process that browsers do, as outlined in the HTML spec:

User agents must calculate the base URI according to the following precedences (highest priority to lowest):

The base URI is set by the BASE element.

The base URI is given by meta data discovered during a protocol interaction, such as an HTTP header (see [RFC2616]).

By default, the base URI is that of the current document. Not all HTML documents have a base URI (e.g., a valid HTML document may appear in an email and may not be designated by a URI). Such HTML documents are considered erroneous if they contain relative URIs and rely on a default base URI.

In practice, you're probably most concerned with numbers 1 and 2 (i.e. check for a <base href="..." and use either that (if it exists) or the URI of the current document).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow