Question

I am using a OCR algorithm (tesseract based) which has difficulties with recognizing certain characters. I have partially solved that by creating my own "post-processing hash-table" which includes pairs of characters. For example, since the text is just numbers, I have figured out that if there is Q character inside the text, it should be 9 instead.

However I have a more serious problem with 6 and 8 characters since both of them are recognized as B. Now since I know what I am looking for (when I am translating the image to text) and the strings are fairly short (6~8 digits), I thought to create strings with all possible combinations of 6 and 8 and compare each one of them to the one I am looking for.

So for example, I have the following string recognized by the OCR:

L0B7B0B5

So each B here can be 6 or 8.

Now I want to generate a list like the below:

L0878085
L0878065
L0876085
L0876065
.
.

So it's kind of binary table with 3 digits and in this case there are 8 options. But the amount of B characters in string can be other than 3 (it can be any number).

I have tried to use Python itertools module with something like that:

list(itertools.product(*["86"] * 3))

Which will provide the following result:

[('8', '8', '8'), ('8', '8', '6'), ('8', '6', '8'), ('8', '6', '6'), ('6', '8', '8'), ('6', '8', '6'), ('6', '6', '8'), ('6', '6', '6')]

which I assume I can then later use to swap B characters. However, for some reason I can't make itertools work in my environment. I assume it has something to do the fact I am using Jython and not pure Python.

I will be happy to hear any other ideas as how to complete this task. Maybe there is a simpler solution I didn't think of?

Was it helpful?

Solution

itertools.product accepts a repeat keyword that you can use:

In [92]: from itertools import product

In [93]: word = "L0B7B0B5"

In [94]: subs = product("68", repeat=word.count("B"))

In [95]: list(subs)
Out[95]: 
[('6', '6', '6'),
 ('6', '6', '8'),
 ('6', '8', '6'),
 ('6', '8', '8'),
 ('8', '6', '6'),
 ('8', '6', '8'),
 ('8', '8', '6'),
 ('8', '8', '8')]

Then one fairly concise method to make the substitutions is to do a reduction operation with the string replace method:

In [97]: subs = product("68", repeat=word.count("B"))

In [98]: [reduce(lambda s, c: s.replace('B', c, 1), sub, word) for sub in subs]
Out[98]: 
['L0676065',
 'L0676085',
 'L0678065',
 'L0678085',
 'L0876065',
 'L0876085',
 'L0878065',
 'L0878085']

Another method, using a couple more functions from itertools:

In [90]: from itertools import chain, izip_longest

In [91]: subs = product("68", repeat=word.count("B"))

In [92]: [''.join(chain(*izip_longest(word.split('B'), sub, fillvalue=''))) for sub in subs]
Out[92]: 
['L0676065',
 'L0676085',
 'L0678065',
 'L0678085',
 'L0876065',
 'L0876085',
 'L0878065',
 'L0878085']

OTHER TIPS

Here simple recursive function for generating your strings : - (It is a pseudo code)

permut(char[] original,char buff[],int i) {


 if(i<original.length) {

      if(original[i]=='B') {

          buff[i] = '6'
          permut(original,buff,i+1)
          buff[i] = '8'
          permut(original,buff,i+1)
      } 

     else if(original[i]=='Q') {

          buff[i] = '9'
          permut(original,buff,i+1)
      }

      else {

          buff[i] = ch[i];
          permut(original,buff,i+1) 
      }
 }

 else {

      store buff[]     

  }


}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top