Question

I have a string which contains digits and letters. I wish to split the string into contiguous chunks of digits and contiguous chunks of letters.

Consider the String "34A312O5M444123A".

I would like to output: ["34", "A", "312", "O", "5", "M", "444123", "A"]

I have code which works and looks like:

List<String> digitsAsElements(String str){
  StringBuilder digitCollector = new StringBuilder();

  List<String> output = new ArrayList<String>();

  for (int i = 0; i < str.length(); i++){
    char cChar = str.charAt(i);

    if (Character.isDigit(cChar))
       digitCollector.append(cChar);
    else{
      output.add(digitCollector.toString());
      output.add(""+cChar);

      digitCollector = new StringBuilder();
    }         
  }

  return output;
}

I considered splitting str twice to get an array containing all the numbers chunks and an array containing the all letters chunks. Then merging the results. I shied away from this as it would harm readability.

I have intentionally avoided solving this with a regex pattern as I find regex patterns to be a major impediment to readability.

  • Debuggers don't handle them well.
  • They interrupt the flow of someone reading source code.
  • Overtime regex's grow organically and become monsters.
  • They are deeply non intuitive.

My questions are:

  • How could I improve the readability of the above code?
  • Is there a better way to do this? A Util class that solves this problem elegantly.
  • Where do you draw the line between using a regEx and coding something simpilar to what I've written above?
  • How do you increase the readability/maintainability of regExes?
Was it helpful?

Solution

Would you be willing to use regexes if it meant solving the problem in one line of code?

// Split at any position that's either:
// preceded by a digit and followed by a non-digit, or
// preceded by a non-digit and followed by a digit.
String[] parts = str.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");

With the comment to explain the regex, I think that's more readable than any of the non-regex solutions (or any of the other regex solutions, for that matter).

OTHER TIPS

For this particular task I'd always use a regex instead of hand-writing something similar. The code you have given above is, at least to me, less readable than a simple regular expression (which would be (\d+|[^\d]+) in this case, as far as I can see).

You may want to avoid writing regular expressions that exceed a few lines. Those can be and usually are unreadable and hard to understand, but so is the code they can be replaced with! Parsers are almost never pretty and you're usually better off reading the original grammar than trying to make sense of the generated (or handwritten) parser. Same goes (imho) for regexes which are just a concise description of a regular grammar.

So, in general I'd say banning regexes in favor of code like you've given in your question sounds like a terribly stupid idea. And regular expressions are just a tool, nothing less, nothing more. If something else does a better job of text parsing (say, a real parser, some substring magic, etc.) then use it. But don't throw away possibilities just because you feel uncomfortable with them – others may have less problems coping with them and all people are able to learn.

EDIT: Updated regex after comment by mmyers.

For a utility class, check out java.util.Scanner. There are a number of options in there as to how you might go about solving your problem. I have a few comments on your questions.

Debuggers don't handle them (regular expressions) well

Whether a regex works or not depends on whats in your data. There are some nice plugins you can use to help you build a regex, like QuickREx for Eclipse, does a debugger actually help you write the right parser for your data?

They interrupt the flow of someone reading source code.

I guess it depends on how comfortable you are with them. Personally, I'd rather read a reasonable regex than 50 more lines of string parsing code, but maybe that's a personal thing.

Overtime regex's grow organically and become monsters.

I guess they might, but that's probably a problem with the code they live in becoming unfocussed. If the complexity of the source data is increasing, you probably need to keep an eye on whether you need a more expressive solution (maybe a parser generator like ANTLR)

They are deeply non intuitive.

They're a pattern matching language. I would say they're pretty intuitive in that context.

How could I improve the readability of the above code?

Not sure, apart from use a regex.

Is there a better way to do this? A Util class that solves this problem elegantly.

Mentioned above, java.util.Scanner.

Where do you draw the line between using a regEx and coding something simpilar to what I've written above?

Personally I use regex for anything reasonably simple.

How do you increase the readability/maintainability of regExes?

Think carefully before extending,take extra care to comment up the code and the regex in detail so that it's clear what you're doing.

I would use something like this (warning, untested code). For me this is a lot more readable than trying to avoid regexps. Regexps are a great tool when used in right place.

Commenting methods and providing examples of input and output values in comments also helps.

List<String> digitsAsElements(String str){
    Pattern p = Pattern.compile("(\\d+|\\w+)*");
    Matcher m = p.matcher(str);

    List<String> output = new ArrayList<String>();
    for(int i = 1; i <= m.groupCount(); i++) {
       output.add(m.group(i));
    }
    return output;
}

I'm not overly crazy about regex myself, but this seems like a case where they will really simplify things. What you might want to do is put them into the smallest method you can devise, name it aptly, and then put all the control code in another method.

For instance, if you coded a "Grab block of numbers or letters" method, the caller would be a very simple, straight-forward loop just printing the results of each call, and the method you were calling would be well-defined so the intention of the regex would be clear even if you didn't know anything about the syntax, and the method would be bounded so people wouldn't be likely to muck it up over time.

The problem with this is that the regex tools are so simple and well-adapted to this use that it's hard to justify a method call for this.

Since no one seems to have posted correct code yet, I'll give it a shot.

First the non-regex version. Note that I use the StringBuilder for accumulating whichever type of character was seen last (digit or non-digit). If the state changes, I dump its contents into the list and start a new StringBuilder. This way consecutive non-digits are grouped just like consecutive digits are.

static List<String> digitsAsElements(String str) {
    StringBuilder collector = new StringBuilder();

    List<String> output = new ArrayList<String>();
    boolean lastWasDigit = false;
    for (int i = 0; i < str.length(); i++) {
        char cChar = str.charAt(i);

        boolean isDigit = Character.isDigit(cChar);
        if (isDigit != lastWasDigit) {
            if (collector.length() > 0) {
                output.add(collector.toString());
                collector = new StringBuilder();
            }
            lastWasDigit = isDigit;
        }
        collector.append(cChar);
    }
    if (collector.length() > 0)
        output.add(collector.toString());

    return output;
}

Now the regex version. This is basically the same code that was posted by Juha S., but the regex actually works.

private static final Pattern DIGIT_OR_NONDIGIT_STRING =
        Pattern.compile("(\\d+|[^\\d]+)");
static List<String> digitsAsElementsR(String str) {
    // Match a consecutive series of digits or non-digits
    final Matcher matcher = DIGIT_OR_NONDIGIT_STRING.matcher(str);
    final List<String> output = new ArrayList<String>();
    while (matcher.find()) {
        output.add(matcher.group());
    }
    return output;
}

One way I try to keep my regexes readable is their names. I think DIGIT_OR_NONDIGIT_STRING conveys pretty well what I (the programmer) think it does, and testing should make sure that it really does what it's meant to do.

public static void main(String[] args) {
    System.out.println(digitsAsElements( "34A312O5MNI444123A"));
    System.out.println(digitsAsElementsR("34A312O5MNI444123A"));
}

prints:

[34, A, 312, O, 5, MNI, 444123, A]
[34, A, 312, O, 5, MNI, 444123, A]

Awww, someone beat me to code. I think the regex version is easier to read/maintain. Also, note the difference in output between the 2 implementations vs the expected output ...

Output:

digitsAsElements1("34A312O5MNI444123A") = [34, A, 312, O, 5, M, , N, , I, 444123, A]
digitsAsElements2("34A312O5MNI444123A") = [34, A, 312, O, 5, MNI, 444123, A]
Expected: [34, A, 312, O, 5, MN, 444123, A]

Compare:

DigitsAsElements.java:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class DigitsAsElements {

    static List<String> digitsAsElements1(String str){
        StringBuilder digitCollector = new StringBuilder();

        List<String> output = new ArrayList<String>();

        for (int i = 0; i < str.length(); i++){
          char cChar = str.charAt(i);

          if (Character.isDigit(cChar))
             digitCollector.append(cChar);
          else{
            output.add(digitCollector.toString());
            output.add(""+cChar);

            digitCollector = new StringBuilder();
          }         
        }

        return output;
      }

    static List<String> digitsAsElements2(String str){
        // Match a consecutive series of digits or non-digits
        final Pattern pattern = Pattern.compile("(\\d+|\\D+)");
        final Matcher matcher = pattern.matcher(str);

        final List<String> output = new ArrayList<String>();
        while (matcher.find()) {
            output.add(matcher.group());
        }

        return output;
      }

    /**
     * @param args
     */
    public static void main(String[] args) {
        System.out.println("digitsAsElements(\"34A312O5MNI444123A\") = " +
                digitsAsElements1("34A312O5MNI444123A"));
        System.out.println("digitsAsElements2(\"34A312O5MNI444123A\") = " +
                digitsAsElements2("34A312O5MNI444123A"));
        System.out.println("Expected: [" +
                "34, A, 312, O, 5, MN, 444123, A"+"]");
    }

}

you could use this class in order to simplify your loop:

public class StringIterator implements Iterator<Character> {

    private final char[] chars;
    private int i;

    private StringIterator(char[] chars) {
        this.chars = chars;
    }

    public boolean hasNext() {
        return i < chars.length;
    }

    public Character next() {
        return chars[i++];
    }

    public void remove() {
        throw new UnsupportedOperationException("Not supported.");
    }

    public static Iterable<Character> of(String string) {
        final char[] chars = string.toCharArray();

        return new Iterable<Character>() {

            @Override
            public Iterator<Character> iterator() {
                return new StringIterator(chars);
            }
        };
    }
}

Now you can rewrite this:

for (int i = 0; i < str.length(); i++){
    char cChar = str.charAt(i);
    ...
}

with:

for (Character cChar : StringIterator.of(str)) {
    ...
}

my 2 cents

BTW this class is also reusable in other context.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top