Java regex to capture html tag and its attributes

https://stackoverflow.com/questions/19576681

01-07-2022
|

Question

(I am aware that regex is not the recommended way to deal with html, but this is my assignment)

I need a regex in Java that will capture html tags and their attributes. I am trying to achieve this with one regex using groups. I expected this regex to work:

<(?!!)(?!/)\s*(\w+)(?:\s*(\S+)=['"]{1}[^>]*?['"]{1})*\s*>
<                                                            the tag starts with <
 (?!!)                                                       I dont't want comments
      (?!/)                                                  I dont't want closing tags
           \s*                                               any number of white spaces 
              (\w+)                                          the tag
                   (?:                                       do not capture the following group
                      \s*                                    any number of white spaces before the first attribute
                         (\S+)                               capture the attributes name
                              =['"]{1}[^>]*?['"]{1}          the ="bottm" or ='bottm' etc.
                                                   )*        close the not-capturing group, it can occure multiple times or zero times
                                                     \s*     any white spaces before the closing of the tag
                                                        >    close the tag

I expected the result for a tag like:

<div id="qwerty" class='someClass' >
group(1) = "div"
group(2) = "id"
group(3) = "class"

but the result is:

group(1) = "div"
group(2) = "class"

Is seems that it is not possible to capture a group multiple times (...)*, is this correct?

As for now I use a reg ex like:

<(?!!)(?!/)\s*(\w+) (?:\s*(\S+)=['"]{1}[^>]*?['"]{1}){0,1} (?:\s*(\S+)=['"]{1}[^>]*?['"]{1}){0,1} (...){0,1} (...){0,1} ... \s*>

I repeat the capturing group for the attribute multiple times and get results like:

<div id="qwerty" class='someClass' >
group(1) = "div"
group(2) = "id"
group(3) = "class" 
group(4) = null 
group(5) = null 
group(6) = null 
...

What other approaches can I use? (I can use multiple regexes, but it is preferred to do it with just one)

La solution

It seems that it is immposible to use match one matching group multiple times. So the result of using

(..regex for group...)*

will still be just one matched group.

Code to catch the whole tag in first step, and then catch all the attributes:

URL url = new URL("http://stackoverflow.com/");
URLConnection connection = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
StringBuilder stringBuilder = new StringBuilder();
String inputLine;
while ((inputLine = reader.readLine()) != null) {
    stringBuilder.append(inputLine);
}
String pageContent = stringBuilder.toString();
Pattern pattern = Pattern.compile("<(?!!)(?!/)\\s*([a-zA-Z0-9]+)(.*?)>");
Matcher matcher = pattern.matcher(pageContent);
while (matcher.find()) {
    String tagName = matcher.group(1);
    String attributes = matcher.group(2);
    System.out.println("tag name: " + tagName);
    System.out.println("     rest of the tag: " + attributes);
    Pattern attributePattern = Pattern.compile("(\\S+)=['\"]{1}([^>]*?)['\"]{1}");
    Matcher attributeMatcher = attributePattern.matcher(attributes);
    while(attributeMatcher.find()) {
        String attributeName = attributeMatcher.group(1);
        String attributeValue = attributeMatcher.group(2);
        System.out.println("         attribute name: " + attributeName + "    value: " + attributeValue);
    }
}

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow