Question

I need to write a Java Comparator class that compares Strings, however with one twist. If the two strings it is comparing are the same at the beginning and end of the string are the same, and the middle part that differs is an integer, then compare based on the numeric values of those integers. For example, I want the following strings to end up in order they're shown:

  • aaa
  • bbb 3 ccc
  • bbb 12 ccc
  • ccc 11
  • ddd
  • eee 3 ddd jpeg2000 eee
  • eee 12 ddd jpeg2000 eee

As you can see, there might be other integers in the string, so I can't just use regular expressions to break out any integer. I'm thinking of just walking the strings from the beginning until I find a bit that doesn't match, then walking in from the end until I find a bit that doesn't match, and then comparing the bit in the middle to the regular expression "[0-9]+", and if it compares, then doing a numeric comparison, otherwise doing a lexical comparison.

Is there a better way?

Update I don't think I can guarantee that the other numbers in the string, the ones that may match, don't have spaces around them, or that the ones that differ do have spaces.

Was it helpful?

Solution

The Alphanum Algorithm

From the website

"People sort strings with numbers differently than software. Most sorting algorithms compare ASCII values, which produces an ordering that is inconsistent with human logic. Here's how to fix it."

Edit: Here's a link to the Java Comparator Implementation from that site.

OTHER TIPS

Interesting little challenge, I enjoyed solving it.

Here is my take at the problem:

String[] strs =
{
  "eee 5 ddd jpeg2001 eee",
  "eee 123 ddd jpeg2000 eee",
  "ddd",
  "aaa 5 yy 6",
  "ccc 555",
  "bbb 3 ccc",
  "bbb 9 a",
  "",
  "eee 4 ddd jpeg2001 eee",
  "ccc 11",
  "bbb 12 ccc",
  "aaa 5 yy 22",
  "aaa",
  "eee 3 ddd jpeg2000 eee",
  "ccc 5",
};

Pattern splitter = Pattern.compile("(\\d+|\\D+)");

public class InternalNumberComparator implements Comparator
{
  public int compare(Object o1, Object o2)
  {
    // I deliberately use the Java 1.4 syntax, 
    // all this can be improved with 1.5's generics
    String s1 = (String)o1, s2 = (String)o2;
    // We split each string as runs of number/non-number strings
    ArrayList sa1 = split(s1);
    ArrayList sa2 = split(s2);
    // Nothing or different structure
    if (sa1.size() == 0 || sa1.size() != sa2.size())
    {
      // Just compare the original strings
      return s1.compareTo(s2);
    }
    int i = 0;
    String si1 = "";
    String si2 = "";
    // Compare beginning of string
    for (; i < sa1.size(); i++)
    {
      si1 = (String)sa1.get(i);
      si2 = (String)sa2.get(i);
      if (!si1.equals(si2))
        break;  // Until we find a difference
    }
    // No difference found?
    if (i == sa1.size())
      return 0; // Same strings!

    // Try to convert the different run of characters to number
    int val1, val2;
    try
    {
      val1 = Integer.parseInt(si1);
      val2 = Integer.parseInt(si2);
    }
    catch (NumberFormatException e)
    {
      return s1.compareTo(s2);  // Strings differ on a non-number
    }

    // Compare remainder of string
    for (i++; i < sa1.size(); i++)
    {
      si1 = (String)sa1.get(i);
      si2 = (String)sa2.get(i);
      if (!si1.equals(si2))
      {
        return s1.compareTo(s2);  // Strings differ
      }
    }

    // Here, the strings differ only on a number
    return val1 < val2 ? -1 : 1;
  }

  ArrayList split(String s)
  {
    ArrayList r = new ArrayList();
    Matcher matcher = splitter.matcher(s);
    while (matcher.find())
    {
      String m = matcher.group(1);
      r.add(m);
    }
    return r;
  }
}

Arrays.sort(strs, new InternalNumberComparator());

This algorithm need much more testing, but it seems to behave rather nicely.

[EDIT] I added some more comments to be clearer. I see there are much more answers than when I started to code this... But I hope I provided a good starting base and/or some ideas.

Ian Griffiths of Microsoft has a C# implementation he calls Natural Sorting. Porting to Java should be fairly easy, easier than from C anyway!

UPDATE: There seems to be a Java example on eekboom that does this, see the "compareNatural" and use that as your comparer to sorts.

The implementation I propose here is simple and efficient. It does not allocate any extra memory, directly or indirectly by using regular expressions or methods such as substring(), split(), toCharArray(), etc.

This implementation first goes across both strings to search for the first characters that are different, at maximal speed, without doing any special processing during this. Specific number comparison is triggered only when these characters are both digits. A side-effect of this implementation is that a digit is considered as greater than other letters, contrarily to default lexicographic order.

public static final int compareNatural (String s1, String s2)
{
   // Skip all identical characters
   int len1 = s1.length();
   int len2 = s2.length();
   int i;
   char c1, c2;
   for (i = 0, c1 = 0, c2 = 0; (i < len1) && (i < len2) && (c1 = s1.charAt(i)) == (c2 = s2.charAt(i)); i++);

   // Check end of string
   if (c1 == c2)
      return(len1 - len2);

   // Check digit in first string
   if (Character.isDigit(c1))
   {
      // Check digit only in first string 
      if (!Character.isDigit(c2))
         return(1);

      // Scan all integer digits
      int x1, x2;
      for (x1 = i + 1; (x1 < len1) && Character.isDigit(s1.charAt(x1)); x1++);
      for (x2 = i + 1; (x2 < len2) && Character.isDigit(s2.charAt(x2)); x2++);

      // Longer integer wins, first digit otherwise
      return(x2 == x1 ? c1 - c2 : x1 - x2);
   }

   // Check digit only in second string
   if (Character.isDigit(c2))
      return(-1);

   // No digits
   return(c1 - c2);
}

I realize you're in java, but you can take a look at how StrCmpLogicalW works. It's what Explorer uses to sort filenames in Windows. You can look at the WINE implementation here.

Split the string into runs of letters and numbers, so "foo 12 bar" becomes the list ("foo", 12, "bar"), then use the list as the sort key. This way the numbers will be ordered in numerical order, not alphabetical.

I came up with a quite simple implementation in Java using regular expressions:

public static Comparator<String> naturalOrdering() {
    final Pattern compile = Pattern.compile("(\\d+)|(\\D+)");
    return (s1, s2) -> {
        final Matcher matcher1 = compile.matcher(s1);
        final Matcher matcher2 = compile.matcher(s2);
        while (true) {
            final boolean found1 = matcher1.find();
            final boolean found2 = matcher2.find();
            if (!found1 || !found2) {
                return Boolean.compare(found1, found2);
            } else if (!matcher1.group().equals(matcher2.group())) {
                if (matcher1.group(1) == null || matcher2.group(1) == null) {
                    return matcher1.group().compareTo(matcher2.group());
                } else {
                    return Integer.valueOf(matcher1.group(1)).compareTo(Integer.valueOf(matcher2.group(1)));
                }
            }
        }
    };
}

Here is how it works:

final List<String> strings = Arrays.asList("x15", "xa", "y16", "x2a", "y11", "z", "z5", "x2b", "z");
strings.sort(naturalOrdering());
System.out.println(strings);

[x2a, x2b, x15, xa, y11, y16, z, z, z5]

The Alphanum algrothim is nice, but it did not match requirements for a project I'm working on. I need to be able to sort negative numbers and decimals correctly. Here is the implementation I came up. Any feedback would be much appreciated.

public class StringAsNumberComparator implements Comparator<String> {

    public static final Pattern NUMBER_PATTERN = Pattern.compile("(\\-?\\d+\\.\\d+)|(\\-?\\.\\d+)|(\\-?\\d+)");

    /**
     * Splits strings into parts sorting each instance of a number as a number if there is
     * a matching number in the other String.
     * 
     * For example A1B, A2B, A11B, A11B1, A11B2, A11B11 will be sorted in that order instead
     * of alphabetically which will sort A1B and A11B together.
     */
    public int compare(String str1, String str2) {
        if(str1 == str2) return 0;
        else if(str1 == null) return 1;
        else if(str2 == null) return -1;

        List<String> split1 = split(str1);
        List<String> split2 = split(str2);
        int diff = 0;

        for(int i = 0; diff == 0 && i < split1.size() && i < split2.size(); i++) {
            String token1 = split1.get(i);
            String token2 = split2.get(i);

            if((NUMBER_PATTERN.matcher(token1).matches() && NUMBER_PATTERN.matcher(token2).matches()) {
                diff = (int) Math.signum(Double.parseDouble(token1) - Double.parseDouble(token2));
            } else {
                diff = token1.compareToIgnoreCase(token2);
            }
        }
        if(diff != 0) {
            return diff;
        } else {
            return split1.size() - split2.size();
        }
    }

    /**
     * Splits a string into strings and number tokens.
     */
    private List<String> split(String s) {
        List<String> list = new ArrayList<String>();
        try (Scanner scanner = new Scanner(s)) {
            int index = 0;
            String num = null;
            while ((num = scanner.findInLine(NUMBER_PATTERN)) != null) {
                int indexOfNumber = s.indexOf(num, index);
                if (indexOfNumber > index) {
                    list.add(s.substring(index, indexOfNumber));
                }
                list.add(num);
                index = indexOfNumber + num.length();
            }
            if (index < s.length()) {
                list.add(s.substring(index));
            }
        }
        return list;
    }
}

PS. I wanted to use the java.lang.String.split() method and use "lookahead/lookbehind" to keep the tokens, but I could not get it to work with the regular expression I was using.

interesting problem, and here my proposed solution:

import java.util.Collections;
import java.util.Vector;

public class CompareToken implements Comparable<CompareToken>
{
    int valN;
    String valS;
    String repr;

    public String toString() {
    return repr;
    }

    public CompareToken(String s) {
    int l = 0;
    char data[] = new char[s.length()];
    repr = s;
    valN = 0;
    for (char c : s.toCharArray()) {
        if(Character.isDigit(c))
        valN = valN * 10 + (c - '0');
        else
        data[l++] = c;
    }

    valS = new String(data, 0, l);
    }

    public int compareTo(CompareToken b) {
    int r = valS.compareTo(b.valS);
    if (r != 0)
        return r;

    return valN - b.valN;
    }


    public static void main(String [] args) {
    String [] strings = {
        "aaa",
        "bbb3ccc",
        "bbb12ccc",
        "ccc 11",
        "ddd",
        "eee3dddjpeg2000eee",
        "eee12dddjpeg2000eee"
    };

    Vector<CompareToken> data = new Vector<CompareToken>();
    for(String s : strings)
        data.add(new CompareToken(s));
    Collections.shuffle(data);

    Collections.sort(data);
    for (CompareToken c : data)
        System.out.println ("" + c);
    }

}

Prior to discovering this thread, I implemented a similar solution in javascript. Perhaps my strategy will find you well, despite different syntax. Similar to above, I parse the two strings being compared, and split them both into arrays, dividing the strings at continuous numbers.

...
var regex = /(\d+)/g,
    str1Components = str1.split(regex),
    str2Components = str2.split(regex),
...

I.e., 'hello22goodbye 33' => ['hello', 22, 'goodbye ', 33]; Thus, you can walk through the arrays' elements in pairs between string1 and string2, do some type coercion (such as, is this element really a number?), and compare as you walk.

Working example here: http://jsfiddle.net/F46s6/3/

Note, I currently only support integer types, though handling decimal values wouldn't be too hard of a modification.

My 2 cents.Is working well for me. I am mainly using it for filenames.

    private final boolean isDigit(char ch)
        {
            return ch >= 48 && ch <= 57;
        }


        private int compareNumericalString(String s1,String s2){

            int s1Counter=0;
            int s2Counter=0;
            while(true){
                if(s1Counter>=s1.length()){
                    break;
                }
                if(s2Counter>=s2.length()){
                    break;
                }
                char currentChar1=s1.charAt(s1Counter++);
                char currentChar2=s2.charAt(s2Counter++);
                if(isDigit(currentChar1) &&isDigit(currentChar2)){
                    String digitString1=""+currentChar1;
                    String digitString2=""+currentChar2;
                    while(true){
                        if(s1Counter>=s1.length()){
                            break;
                        }
                        if(s2Counter>=s2.length()){
                            break;
                        }

                        if(isDigit(s1.charAt(s1Counter))){
                            digitString1+=s1.charAt(s1Counter);
                            s1Counter++;
                        }

                        if(isDigit(s2.charAt(s2Counter))){
                            digitString2+=s2.charAt(s2Counter);
                            s2Counter++;
                        }

                        if((!isDigit(s1.charAt(s1Counter))) && (!isDigit(s2.charAt(s2Counter)))){
                            currentChar1=s1.charAt(s1Counter);
                            currentChar2=s2.charAt(s2Counter);
                            break;
                        }
                    }
                    if(!digitString1.equals(digitString2)){
                        return Integer.parseInt(digitString1)-Integer.parseInt(digitString2);
                    }
                }

                if(currentChar1!=currentChar2){
                    return currentChar1-currentChar2;
                }

            }
            return s1.compareTo(s2);
        }

I think you'll have to do the comparison on a character-by-character fashion. Grab a character, if it's a number character, keep grabbing, then reassemble to characters into a single number string and convert it into an int. Repeat on the other string, and only then do the comparison.

Short answer: based on the context, I can't tell whether this is just some quick-and-dirty code for personal use, or a key part of Goldman Sachs' latest internal accounting software, so I'll open by saying: eww. That's a rather funky sorting algorithm; try to use something a bit less "twisty" if you can.

Long answer:

The two issues that immediately come to mind in your case are performance, and correctness. Informally, make sure it's fast, and make sure your algorithm is a total ordering.

(Of course, if you're not sorting more than about 100 items, you can probably disregard this paragraph.) Performance matters, as the speed of the comparator will be the largest factor in the speed of your sort (assuming the sort algorithm is "ideal" to the typical list). In your case, the comparator's speed will depend mainly on the size of the string. The strings seem to be fairly short, so they probably won't dominate as much as the size of your list.

Turning each string into a string-number-string tuple and then sorting this list of tuples, as suggested in another answer, will fail in some of your cases, since you apparently will have strings with multiple numbers appearing.

The other problem is correctness. Specifically, if the algorithm you described will ever permit A > B > ... > A, then your sort will be non-deterministic. In your case, I fear that it might, though I can't prove it. Consider some parsing cases such as:

  aa 0 aa
  aa 23aa
  aa 2a3aa
  aa 113aa
  aa 113 aa
  a 1-2 a
  a 13 a
  a 12 a
  a 2-3 a
  a 21 a
  a 2.3 a

Although the question asked a java solution, for anyone who wants a scala solution:

object Alphanum {

   private[this] val regex = "((?<=[0-9])(?=[^0-9]))|((?<=[^0-9])(?=[0-9]))"

   private[this] val alphaNum: Ordering[String] = Ordering.fromLessThan((ss1: String, ss2: String) => (ss1, ss2) match {
     case (sss1, sss2) if sss1.matches("[0-9]+") && sss2.matches("[0-9]+") => sss1.toLong < sss2.toLong
     case (sss1, sss2) => sss1 < sss2
   })

   def ordering: Ordering[String] = Ordering.fromLessThan((s1: String, s2: String) => {
     import Ordering.Implicits.infixOrderingOps
     implicit val ord: Ordering[List[String]] = Ordering.Implicits.seqDerivedOrdering(alphaNum)

     s1.split(regex).toList < s2.split(regex).toList
   })

}

My problem was that I have lists consisting of a combination of alpha numeric strings (eg C22, C3, C5 etc), alpha strings (eg A, H, R etc) and just digits (eg 99, 45 etc) that need sorting in the order A, C3, C5, C22, H, R, 45, 99. I also have duplicates that need removing so I only get a single entry.

I'm also not just working with Strings, I'm ordering an Object and using a specific field within the Object to get the correct order.

A solution that seems to work for me is :

SortedSet<Code> codeSet;
codeSet = new TreeSet<Code>(new Comparator<Code>() {

private boolean isThereAnyNumber(String a, String b) {
    return isNumber(a) || isNumber(b);
}

private boolean isNumber(String s) {
    return s.matches("[-+]?\\d*\\.?\\d+");
}

private String extractChars(String s) {
    String chars = s.replaceAll("\\d", "");
    return chars;
}

private int extractInt(String s) {
    String num = s.replaceAll("\\D", "");
    return num.isEmpty() ? 0 : Integer.parseInt(num);
}

private int compareStrings(String o1, String o2) {

    if (!extractChars(o1).equals(extractChars(o2))) {
        return o1.compareTo(o2);
    } else
        return extractInt(o1) - extractInt(o2);
}

@Override
public int compare(Code a, Code b) {

    return isThereAnyNumber(a.getPrimaryCode(), b.getPrimaryCode()) 
            ? isNumber(a.getPrimaryCode()) ? 1 : -1 
                : compareStrings(a.getPrimaryCode(), b.getPrimaryCode());
                }
            });

It 'borrows' some code that I found here on Stackoverflow plus some tweaks of my own to get it working just how I needed it too.

Due to trying to order Objects, needing a comparator as well as duplicate removal, one negative fudge I had to employ was I first have to write my Objects to a TreeMap before writing them to a Treeset. It may impact performance a little but given that the lists will be a max of about 80 Codes, it shouldn't be a problem.

I had a similar problem where my strings had space-separated segments inside. I solved it in this way:

public class StringWithNumberComparator implements Comparator<MyClass> {

@Override
public int compare(MyClass o1, MyClass o2) {
    if (o1.getStringToCompare().equals(o2.getStringToCompare())) {
        return 0;
    }
    String[] first = o1.getStringToCompare().split(" ");
    String[] second = o2.getStringToCompare().split(" ");
    if (first.length == second.length) {
        for (int i = 0; i < first.length; i++) {

            int segmentCompare = StringUtils.compare(first[i], second[i]);
            if (StringUtils.isNumeric(first[i]) && StringUtils.isNumeric(second[i])) {

                segmentCompare = NumberUtils.compare(Integer.valueOf(first[i]), Integer.valueOf(second[i]));
                if (0 != segmentCompare) {
                    // return only if uneven numbers in case there are more segments to be checked
                    return segmentCompare;
                }
            }
            if (0 != segmentCompare) {
                return segmentCompare;
            }
        }
    } else {
        return StringUtils.compare(o1.getDenominazione(), o2.getDenominazione());
    }

    return 0;
}

As you can see I have used Apaches StringUtils.compare() and NumberUtils.compere() as a standard help.

In your given example, the numbers you want to compare have spaces around them while the other numbers do not, so why would a regular expression not work?

bbb 12 ccc

vs.

eee 12 ddd jpeg2000 eee

If you're writing a comparator class, you should implement your own compare method that will compare two strings character by character. This compare method should check if you're dealing with alphabetic characters, numeric characters, or mixed types (including spaces). You'll have to define how you want a mixed type to act, whether numbers come before alphabetic characters or after, and where spaces fit in etc.

On Linux glibc provides strverscmp(), it's also available from gnulib for portability. However truly "human" sorting has lots of other quirks like "The Beatles" being sorted as "Beatles, The". There is no simple solution to this generic problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top