Finding numerical substrings mathematically, without string comparison

https://stackoverflow.com/questions/231917

04-07-2019
|

Question

This originally was a problem I ran into at work, but is now something I'm just trying to solve for my own curiosity.

I want to find out if int 'a' contains the int 'b' in the most efficient way possible. I wrote some code, but it seems no matter what I write, parsing it into a string and then using indexOf is twice as fast as doing it mathematically.

Memory is not an issue (within reason), just sheer processing speed.

This is the code I have written to do it mathematically:

private static int[] exponents = {10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000 };

private static boolean findMatch(int a, int b) {
    if (b > a) return false;

    if (a == b) return true;

    int needleLength = getLength(b);

    int exponent = exponents[needleLength];
    int subNum;
    while (a >= 1) {
        subNum = a % exponent;

        if (subNum == b)
            return true;

        a /= 10;
    }
    return false;
}

private static int getLength(int b) {

    int len = 0;

    while (b >= 1) {
        len++;
        b /= 10;
    }

    return len;
}

Here's the string method I'm using, which seems to trump the mathematical method above:

private static boolean findStringMatch(int a, int b) {      
    return String.valueOf(a).indexOf(String.valueOf(b)) != -1;      
}

So although this isn't really required for me to complete my work, I was just wondering if anyone could think of any way to further optimize my way of doing it mathematically, or an entirely new approach altogether. Again memory is no problem, I am just shooting for sheer speed.

I'm really interested to see or hear anything anyone has to offer on this.

EDIT: When I say contains I mean can be anywhere, so for example, findMatch(1234, 23) == true

EDIT: For everyone saying that this crap is unreadable and unnecessary: you're missing the point. The point was to get to geek out on an interesting problem, not come up with an answer to be used in production code.

Solution

This is along Kibbee's line, but I got a little intrigued by this before he posted and worked this out:

long mask ( long n ) { 
    long m   = n % 10;
    long n_d = n;
    long div = 10;
    int  shl = 0;
    while ( n_d >= 10 ) { 
        n_d /= 10;
        long t = n_d % 10;
        m |= ( t << ( shl += 4 ));
    }
    return m;
}

boolean findMatch( int a, int b ) { 
    if ( b < a  ) return false;
    if ( a == b ) return true;

    long m_a = mask( a );    // set up mask O(n)
    long m_b = mask( b );    // set up mask O(m)

    while ( m_a < m_b ) {
        if (( m_a & m_b ) == m_a ) return true;
        m_a <<= 4;  // shift - fast!
        if ( m_a == m_b ) return true;
    }  // O(p)
    return false;
}       

void testContains( int a, int b ) { 
    print( "findMatch( " + a + ", " + b + " )=" + findMatch( a, b ));
}

testContains( 12, 120 );
testContains( 12, 125 );
testContains( 123, 551241238 );
testContains( 131, 1214124 );
testContains( 131, 1314124 );

Since 300 characters is far too little to make an argument in, I'm editing this main post to respond to Pyrolistical.

Unlike the OP, I wasn't that surprised that a native compiled indexOf was faster than Java code with primitives. So my goal was not to find something I thought was faster than a native method called zillions of times all over Java code.

The OP made it clear that this was not a production problem and more along the lines of an idle curiosity, so my answer solves that curiosity. My guess was that speed was an issue, when he was trying to solve it in production, but as an idle curiosity, "This method will be called millions and millions of times" no longer applies. As he had to explain to one poster, it's no longer pursued as production code, so the complexity no longer matters.

Plus it provides the only implementation on the page that manages to find the "123" in "551241238", so unless correctness is an extraneous concern, it provides that. Also the solution space of "an algorithm that solves the problem mathematically using Java primitives but beats optimized native code" might be EMPTY.

Plus, it's not clear from your comment whether or not you compared apples to apples. The functional spec is f( int, int )-> boolean, not f( String, String )-> boolean (which is kind of the domain of indexOf) . So unless you tested something like this (which could still beat mine, and I wouldn't be awfully surprised.) the additional overhead might eat up some of that excess 40%.

boolean findMatch( int a, int b ) { 
    String s_a = "" + a;
    String s_b = "" + b;
    return s_a.indexOf( s_b ) > -1;
}

It does the same basic steps. log₁₀( a ) encoding + log₁₀( b ) encoding + actually finding the match, which is as well O(n) where n is the largest logarithm.

OTHER TIPS

It should be faster string way, because your problem is textual, not mathematical. Notice that the your "contains" relationship says nothing about the numbers, it only says something about their decimal representations.

Notice also that the function you want to write will be unreadable - another developer will never understand what you are doing. (See what trouble you had with that here.) The string version, on the other hand, is perfectly clear.

The only optimization that I can think of is to do the conversion to string on your own and compare digits (right to left) as you do the conversion. First convert all the digits of b, then convert from the right on a until you find a match on the first digit of b (from right). Compare until all of b matches or you hit a mismatch. If you hit a mismatch, backtrack to the point where you starting matching the first digit of b, advance in a and start over.

IndexOf will have to do basically the same back tracking algorithm, except from the left. Depending on the actual numbers this may be faster. I think if the numbers are random, it should be since there should be many times when it doesn't have to convert all of a.

Looks like your function is actually doing pretty well, but an small improvement:

private static boolean findMatch(int a, int b) {
        if (b > a) return false;

        if (a == b) return true;

        int needleLength = getLength(b);

        int exponent = exponents[needleLength];
        int subNum;
        while (a > b) {
                subNum = a % exponent;

                if (subNum == b)
                        return true;

                a /= 10;
        }
        return false;
}

Just because once that a is smaller than b, is not worthy keeps looking, isnt it? Good luck and post if you find the solution!

This is an interesting problem. Many of String.class's functions are actually native making beating String a difficult proposition. But here's some helpers:

TIP 1: Different simple integer operations have different speeds.

By quick calculations in sample programs showed:

% ~ T
* ~ 4T
/ ~ 7T

So you want to use as little division as possible in favor of multiplication or modulo. Not shown are subtraction, addition, and comparison operators cause they blow all of these out of the water. Also, using "final" as much as possible allows the JVM to do certain optimizations. Speeding up you "getLength" function:

private static int getLength(final int b) {        
   int len = 0;
   while (b > exponents[len]) {
       len++;
   }
   return len + 1
}

That gives about a 7x improvement in the function. You get an indexOutOfBounds exception if b > your max in exponents. To solve that, you can have:

private static int getLength(final int b) {        
   int len = 0;
   final int maxLen = exponents.length;
   while (len < maxLen && b > exponents[len]) {
       len++;
   }
   return len + 1;
}

That's slightly slower and gives you an incorrect length if b is too big, but it doesn't throw an exception.

TIP 2: Unnecessary object/primitive creation and method calls add to run time.

I'm guessing that "getLength" isn't called anywhere else, so while it might be nice to have a separate function, from a optimization standpoint its an unnecessary method call and creation of the object "len". We can put that code right where we use it.

private static boolean findMatch(int a, final int b) {
        if (b > a) return false;
        if (a == b) return true;
        int needleLength = 0;
        while (b > exponents[len]) {
            needleLength ++;
        }
        needleLength++;

        final int exponent = exponents[needleLength];
        int subNum;
        while (a >= 1 && a <= b) {
                subNum = a % exponent;
                if (subNum == b)
                        return true;
                a /= 10;
        }
        return false;
}

Also, note I changed the bottom while loop to also include "a <= b". I haven't tested that and not sure if the per-iteration penalty beats the fact that you don't waste any iterations. I'm sure there's a way to get rid of the division using clever math, but I can't think of it right now.

Umm, I'm probably totally misunderstanding the question, but.....

// Check if A is inside B lol
bool Contains (int a, int b)
{
    return (a <= b);
}

Unless you want to know if a particular sequence of numbers is within another sequence of numbers.

In that case, converting it to a string WILL be faster than doing the math to figure it out.

This in no way answers your question, whatsoever, but it's advice anyway :-)

The method name findMatch is not very descriptive. In this case, I'd have a static method ContainerBuilder.number(int), which returned a ContainerBuilder, which has the method contains on it. In this way your code becomes:

boolean b = number(12345).contains(234);

Juts some advice for the long run!

Oh yes, I meant to say also, you should define what you mean by "contains"

Is there any way to calculate this in binary? Obviously the binary value of an integer containing the binary integer of another character doesn't mean that the decical does the same. However, is there some kind of binary trickary that could be used? Maybe convert a numer like 12345 to 0001 0010 0011 0100 0101, and then do some bit shifting to figure out if 23 (0010 0011) is contained in there. Because your character set is only 10 characters, you could cut down the computation time by store 2 characters values in a single byte.

EDIT

Expanding on this idea a bit. if you have 2 integers, A and B, and want to know if A contains B, you check 2 things first. if A is less than B, then A cannot contain B. If A = B then A contains B. At this point you can convert them to strings*. If A contains the same number of character numbers as B, then A does not contain B, unless they are equal, but we wouldn't be here if they are equal, so if both strings are the same length, a does not contain b. At this point, the length of A will be longer than B. So, now you can convert the strings to their packed binary values as I noted in the first part of this post. Store these values in an array of integers. Now you do a bitwise AND Of the integer values in your array, and if the result is A, then A contains B. Now you shift the array of integers for B, to the left 4 bits, and do the conparison again. Do this until you start popping bits off the left of B.

*That * in the previous paragraph means you may be able to skip this step. There may be a way to do this without using strings at all. There might be some fancy binary trick you can do to get the packed binary representation I discussed in the first paragraph. There should be some binary trick you can use, or some quick math which will convert an integer to the decimal value I discussed before.

Can I ask where you're using this function in your code? Maybe there's another way to solve the problem it is currently solving which would be much faster. This could be like when my friend asked me to completely re-tune his guitar, and I did it before realizing I could have just lowered the bottom string by a whole step and gotten an equivalent result.

FYI

http://refactormycode.com/

Could work for you.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow