Question

I need to find all the urls present in an html file which is stored in my computer itself and extract the links and store it to a variable. I'm using the code below to scan the file and get the lines. But I'm having a hard time extracting just the links. I would appreciate if someone could help me out.

    Scanner htmlScanner = new Scanner(new File(args[0]));
    PrintWriter output = new PrintWriter(new FileWriter(args[1]));
    while(htmlScanner.hasNext()){
        output.print(htmlScanner.next());

    }
    System.out.println("\nDone");
    htmlScanner.close();
    output.close();
Was it helpful?

Solution

You can actually do this with the Swing HTML parser. Though the Swing parser only understands HTML 3.2, tags introduced in later HTML versions will simply be treated as unknown, and all you actually want are links anyway.

static Collection<String> getLinks(Path file)
throws IOException,
       MimeTypeParseException,
       BadLocationException {

    HTMLEditorKit htmlKit = new HTMLEditorKit();

    HTMLDocument htmlDoc;
    try {
        htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        try (Reader reader =
            Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {

            htmlKit.read(reader, htmlDoc, 0);
        }
    } catch (ChangedCharSetException e) {
        MimeType mimeType = new MimeType(e.getCharSetSpec());
        String charset = mimeType.getParameter("charset");

        htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        htmlDoc.putProperty("IgnoreCharsetDirective", true);
        try (Reader reader =
            Files.newBufferedReader(file, Charset.forName(charset))) {

            htmlKit.read(reader, htmlDoc, 0);
        }
    }

    Collection<String> links = new ArrayList<>();

    for (HTML.Tag tag : Arrays.asList(HTML.Tag.LINK, HTML.Tag.A)) {
        HTMLDocument.Iterator it = htmlDoc.getIterator(tag);
        while (it.isValid()) {
            String link = (String)
                it.getAttributes().getAttribute(HTML.Attribute.HREF);

            if (link != null) {
                links.add(link);
            }

            it.next();
        }
    }

    return links;
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top