How to search in a HTML file for some tags?

https://stackoverflow.com/questions/672791

21-08-2019
|

Question

I'm having a little problem in Java. How to do this: I want to search in a HTML file for the tags href and src, and then I want to get the URL associated with that tags.

What is the best way to do it?

Thanks for the help. Best regards.

Solution

This is the code I used to accomplish exactly what you'd like to do, but first let me give you a few tips.

If you're in a Java Swing environment, make sure to use the methods in the javax.swing.text.html and javax.swing.text.html.parser packages. Unfortunately, they're mostly intended for use on a JEditorPane, but I'd still strongly recommend that you take a look at these.

There's a class in the Java 6 API called HTML.Tag that identifies the HTML start and end tags, which you can then use in order to determine where the links are that you'd like your program to follow.http://java.sun.com/javase/6/docs/api/javax/swing/text/html/HTML.Tag.html

When I wrote a program very similar to this, I used 3 main methods:

public void handleStartTag(HTML.Tag t, MUtableAttributeSet atts, int pos)
public void handleEndTag(HTML.Tag t, int pos)
public void handleText(char[] text, int pos)

If you need more help on how to write these methods, you can message me, but basically, you are looking for an initial tag and an end tag and then from that you will have identified the url and then you can proceed to the next step, which is following the url.

To follow the url, I advise you to use the JEditorPane object. The javax.swing.event.HyperlinkListener interface defines only one method, hyperlinkUpdate(HyperlinkEvent e), which you can pass the url into and then call .setPage(evt.getURL()) on your JEditorPane object. This will then update the pane with the new page and allow you to start the process again.

Msg me if you have any probs and please vote this answer!

OTHER TIPS

Do you want to do this as a one-time editing task, or do you need a systematic (i.e. code) implementation? In the second case, find a Java HTML parser implementation and walk the DOM tree.

http://java-source.net/open-source/html-parsers

Take a look at this question:

The answer I used was JTidy

You can use Rhino, then load the html file. Once it gets loaded you can used getElementBy to go to any node or to get value.

If your file is an xhtml document, it is a standard xml document and the bast way to parse it is using jdom. JDom is very powerful and easy to use and understand.

If you have an html document you can try htmlparser, in particoular the class LinkTag.

I would have a look at tagsoup, which will build a DOM tree from any HTML document, even the most non-compliant ones.

Then use XPath and iterate over the NodeList returned by:

//a

and

//img

I've used the Neko HTML Parser successfully for this sort of thing (screen scraping).

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Node;

public class TestParser {

     public static void main(String[] argv) throws Exception {
          DOMParser parser = new DOMParser();
          for (int i = 0; i

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow