Is there an algorithm to find the smallest set of the shortest prefix substrings of a continuous numeric sequence?

https://cs.stackexchange.com/questions/124480

29-09-2020
|

Question

Before anything I want to preemptively thank anyone who drops by for their patience, I don't have any formal CS background so I'm probably going to use some of these terms wrong.

I have a puzzle: Given two numbers which define a set of continuous counting numbers of the same number of digits, between roughly 5 and 12 digits long (IE 50000 and 60000, 32325600000 and 32399999999), what's the fastest and most efficient way to condense this down to a set of prefixes which "contain" all permutations of subsequent digits?

The approach we've been using is a hybrid of treating these as numbers and character strings. First remove any pairs of matched 0's and 9's at the end of the start/end. Second create the full sequence copied to two columns, where the second column is always a substring with the rightmost digit removed relative to the first column. From there I can recursively count how many times any given one-digit-shorter substring occurs, keep the items where N-count<10, and where N-count>=10 remove another digit from both columns and repeat.

What I'm wondering is if there's a quicker and more efficient way to do this. String operations instead of math was an obvious quick fix, but the general approach still relies on recursively grouping and chopping off characters. I've considered making a full series of Prefix and N-count columns going back to the highest digit but at least instinctively that feels like it would be less efficient than operating recursively on a decreasing pool of numbers.

IE
Input: 
Start=50000000 
End=55399999

which becomes
Start=500 
End=553

Cycle one creates two sequence columns like this:

String   Prefix     N-Count
500        50          10
501        50          10
etc..                  
510        51          10
etc..
550        55          6
551        55          6
552        55          6
553        55          6   

Cycle two keeps everything where N-count<10 the same, but reduces the rest by 1
digit each and recalculates N-count (while getting rid of duplicates).

String   Prefix     N-Count
50        5          5
51        5          5
52        5          5         
53        5          5
54        5          5       
550       55         4
551       55         4
552       55         4
553       55         4  


Output:  50,51,52,53,54,55,550,551,552,553 
```

Solution

Suppose the input is $a,b$, two $n$-digit long numbers. We allow leading zeroes (we will see in a moment why). Let $c$ be the longest common prefix of $a,b$, and let $a=cA$, $b=cB$.

If $A = 0^{n-|c|}$ and $B = 9^{n-|c|}$ then we simply output $c$ (this includes the case $|c|=n$).

Otherwise, let $d_A$ be the first digit of $A$, and let $d_B$ be the first digit of $B$.

Recursively find a solution for the ranges $[A,d_A 9^{|A|-1}]$ and $[d_B 0^{|B|-1},B]$, and prefix $c$ to everything. Also, add $c(d_A+1),\ldots,c(d_B-1)$.

Here is an unoptimized python implementation:

def prefixes(a,b,C=''):
     n, m = len(a), max(i for i in range(len(a)+1) if a[:i] == b[:i])
     c, A, B = C+a[:m], a[m:], b[m:]
     if A == '0'*len(A) and B == '9'*len(B):
         yield c
     else:
         yield from prefixes(A[1:],'9'*(len(A)-1),c+A[0])
         for i in range(int(A[0])+1,int(B[0])):
             yield f'{c}{i}'
         yield from prefixes('0'*(len(B)-1),B[1:],c+B[0])

For example, if you run list(prefixes('50000000','55399999')) then you get the following output: ['50', '51', '52', '53', '54', '550', '551', '552', '553']

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange