Question

I am reading the content from a web page and then I am parsing it with the help of Jsoup parser to get only the hyperlinks that exists in the body section. I am getting the output as:

<a href="/sports/sports.asp" style="TEXT-DECORATION: NONE"><font color="#0000FF">Sports</font></a>
<a href="/titanic/titanic.asp" style="TEXT-DECORATION: NONE"><font color="#0000FF">Titanic</font></a>
<a href="gastheft.asp" onmouseover="window.status='License Plate Theft';return true" onmouseout="window.status='';return true">license plates</a>
<a href="miracle.asp" onmouseover="window.status='Miracle Cars';return true" onmouseout="window.status='';return true">miracle cars</a>
<a href="/crime/warnings/clear.asp" onmouseover="window.status='Clear Loss';return true" onmouseout="window.status='';return true" target="clear">Clear</a>

and even more hyperlinks.

From all of them, all I am interested in is data like

/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp

How can I do this using Strings or is there any other way or method to extract this information usinf Jsoup Parser itself?

Was it helpful?

Solution 2

Try this it may help

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String nextIndex = linkHref .indexOf ("\"", linkHref );

OTHER TIPS

You can try this, its works.

public class AttributeParsing {

/**
 * @param args
 */
public static void main(String[] args) {
    final String html = "<a href=\"/sports/sports.asp\" style=\"TEXT-DECORATION: NONE\"><font color=\"#0000FF\">Sports</font></a>";

    Document doc = Jsoup.parse(html, "", Parser.xmlParser());
    Element th = doc.select("a[href]").first();

    String href = th.attr("href");

    System.out.println(th);
    System.out.println(href);
}

}

Output :

th : <a href="/sports/sports.asp" style="TEXT-DECORATION: NONE"><font color="#0000FF">Sports</font></a>

href : /sports/sports.asp

This should be a basic bit of parsign using

String.indexOf 

as in

index = jsoupOutput.indexOf ("href=\"");

and

nextIndex = jsoupOutput.indexOf ("\"", index);

with the necessary checks in place.

Let's assume that String anchor contains one of these links then the beginning index of the substring will after href=" and the end index will be the first quotation mark after index 9 this way:

String anchor = "<a href=\"/sports/sports.asp\" style=\"TEXT-DECORATION: NONE\"><font color=\"#0000FF\">Sports</font></a>";
int beginIndex = anchor.indexOf("href=\"") + 6; //To start after <a href="
int endIndex = anchor.indexOf("\"", beginIndex);
String desiredPart = anchor.substring(beginIndex, endIndex);

And that's it if the shape of the anchor is going to always be that way.. better options are using regular expressions and best would be using an XML parser.

Use this as reference

import java.util.regex.*;

public class HelloWorld{

     public static void main(String []args){

         String s = "<a href=\"/sports/sports.asp\" style=\"TEXT-DECORATION: NONE\"><font color=\"#0000FF\">Sports</font></a>"+
                    "<a href=\"/titanic/titanic.asp\" style=\"TEXT-DECORATION: NONE\"><font color=\"#0000FF\">Titanic</font></a>"+
                    "<a href=\"gastheft.asp\" onmouseover=\"window.status='License Plate Theft';return true\" onmouseout=\"window.status='';return true\">license plates</a>"+
                    "<a href=\"miracle.asp\" onmouseover=\"window.status='Miracle Cars';return true\" onmouseout=\"window.status='';return true\">miracle cars</a>"+
                    "<a href=\"/crime/warnings/clear.asp\" onmouseover=\"window.status='Clear Loss';return true\" onmouseout=\"window.status='';return true\" target=\"clear\">Clear</a>";
       Pattern p = Pattern.compile("href=\".+?\"");
       Matcher m = p.matcher(s);
       while(m.find())
       {
           System.out.println(m.group().split("=")[1].replace("\"",""));
       }

     }
}

Output

/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp

You can do it in one line:

String[] paths = str.replaceAll("(?m)^.*?\"(.*?)\".*?$", "$1").split("(?ms)$.*?^");

The first method call removes everything except the target from each line, and the second splits on newlines (will work on all OS terminators).

FYI (?m) turns on "multiline mode" and (?ms) also turns on the "dotall" flag.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top