Getting substring from a given string in Java

Question 1

Try this it may help

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String nextIndex = linkHref .indexOf ("\"", linkHref );

Question 2

You can try this, its works.

public class AttributeParsing {

/**
 * @param args
 */
public static void main(String[] args) {
    final String html = "<a href=\"/sports/sports.asp\" style=\"TEXT-DECORATION: NONE\"><font color=\"#0000FF\">Sports</font></a>";

    Document doc = Jsoup.parse(html, "", Parser.xmlParser());
    Element th = doc.select("a[href]").first();

    String href = th.attr("href");

    System.out.println(th);
    System.out.println(href);
}

}

Output :

th : <a href="/sports/sports.asp" style="TEXT-DECORATION: NONE"><font color="#0000FF">Sports</font></a>

href : /sports/sports.asp

Question 3

This should be a basic bit of parsign using

String.indexOf

as in

index = jsoupOutput.indexOf ("href=\"");

and

nextIndex = jsoupOutput.indexOf ("\"", index);

with the necessary checks in place.

Question 4

Let's assume that String anchor contains one of these links then the beginning index of the substring will after href=" and the end index will be the first quotation mark after index 9 this way:

String anchor = "<a href=\"/sports/sports.asp\" style=\"TEXT-DECORATION: NONE\"><font color=\"#0000FF\">Sports</font></a>";
int beginIndex = anchor.indexOf("href=\"") + 6; //To start after <a href="
int endIndex = anchor.indexOf("\"", beginIndex);
String desiredPart = anchor.substring(beginIndex, endIndex);

And that's it if the shape of the anchor is going to always be that way.. better options are using regular expressions and best would be using an XML parser.

Question 5

Use this as reference

import java.util.regex.*;

public class HelloWorld{

     public static void main(String []args){

         String s = "<a href=\"/sports/sports.asp\" style=\"TEXT-DECORATION: NONE\"><font color=\"#0000FF\">Sports</font></a>"+
                    "<a href=\"/titanic/titanic.asp\" style=\"TEXT-DECORATION: NONE\"><font color=\"#0000FF\">Titanic</font></a>"+
                    "<a href=\"gastheft.asp\" onmouseover=\"window.status='License Plate Theft';return true\" onmouseout=\"window.status='';return true\">license plates</a>"+
                    "<a href=\"miracle.asp\" onmouseover=\"window.status='Miracle Cars';return true\" onmouseout=\"window.status='';return true\">miracle cars</a>"+
                    "<a href=\"/crime/warnings/clear.asp\" onmouseover=\"window.status='Clear Loss';return true\" onmouseout=\"window.status='';return true\" target=\"clear\">Clear</a>";
       Pattern p = Pattern.compile("href=\".+?\"");
       Matcher m = p.matcher(s);
       while(m.find())
       {
           System.out.println(m.group().split("=")[1].replace("\"",""));
       }

     }
}

Output

/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp

Question 6

You can do it in one line:

String[] paths = str.replaceAll("(?m)^.*?\"(.*?)\".*?$", "$1").split("(?ms)$.*?^");

The first method call removes everything except the target from each line, and the second splits on newlines (will work on all OS terminators).

FYI (?m) turns on "multiline mode" and (?ms) also turns on the "dotall" flag.