I need to find all the urls present in an html file which is stored in my computer itself and extract the links and store it to a variable. I'm using the code below to scan the file and get the lines. But I'm having a hard time extracting just the links. I would appreciate if someone could help me out.

    Scanner htmlScanner = new Scanner(new File(args[0]));
    PrintWriter output = new PrintWriter(new FileWriter(args[1]));
    while(htmlScanner.hasNext()){
        output.print(htmlScanner.next());

    }
    System.out.println("\nDone");
    htmlScanner.close();
    output.close();
有帮助吗?

解决方案

You can actually do this with the Swing HTML parser. Though the Swing parser only understands HTML 3.2, tags introduced in later HTML versions will simply be treated as unknown, and all you actually want are links anyway.

static Collection<String> getLinks(Path file)
throws IOException,
       MimeTypeParseException,
       BadLocationException {

    HTMLEditorKit htmlKit = new HTMLEditorKit();

    HTMLDocument htmlDoc;
    try {
        htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        try (Reader reader =
            Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {

            htmlKit.read(reader, htmlDoc, 0);
        }
    } catch (ChangedCharSetException e) {
        MimeType mimeType = new MimeType(e.getCharSetSpec());
        String charset = mimeType.getParameter("charset");

        htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        htmlDoc.putProperty("IgnoreCharsetDirective", true);
        try (Reader reader =
            Files.newBufferedReader(file, Charset.forName(charset))) {

            htmlKit.read(reader, htmlDoc, 0);
        }
    }

    Collection<String> links = new ArrayList<>();

    for (HTML.Tag tag : Arrays.asList(HTML.Tag.LINK, HTML.Tag.A)) {
        HTMLDocument.Iterator it = htmlDoc.getIterator(tag);
        while (it.isValid()) {
            String link = (String)
                it.getAttributes().getAttribute(HTML.Attribute.HREF);

            if (link != null) {
                links.add(link);
            }

            it.next();
        }
    }

    return links;
}
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top