Question

I have a dictionary of characters and their position on a page keyed by their y position (so all characters in a row are under a single key in the dictionary). The data comes from a table from a pdf and I am trying to combine the characters in rows into words based on spacing so that columns are separated as values. So this:

380.822: [[u'1', [61.2, 380.822, 65.622, 391.736]],
[u' ', [65.622, 380.822, 67.834, 391.736]],
[u'p', [81.738, 380.822, 83.503, 391.736]],
[u'i', [84.911, 380.822, 89.333, 391.736]],
[u'e', [90.741, 380.822, 95.163, 391.736]],
[u'c', [96.571, 380.822, 100.548, 391.736]],
[u'e', [100.548, 380.822, 104.97, 391.736]],
[u' ', [104.97, 380.822, 107.181, 391.736]],
[u'8', [122.81, 380.822, 127.232, 391.736]],
[u'9', [127.723, 380.822, 132.146, 391.736]],
[u'0', [132.636, 380.822, 137.059, 391.736]],
[u'1', [137.55, 380.822, 141.972, 391.736]],
[u'S', [142.463, 380.822, 146.885, 391.736]],
[u'Y', [147.376, 380.822, 152.681, 391.736]],
[u'R', [153.172, 380.822, 157.595, 391.736]],
[u'8', [157.595, 380.822, 162.017, 391.736]]]

would become this:

380.822: [[u'1 ', [61.2, 380.822, 67.834, 391.736]],
[u'piece ', [81.738, 380.822, 107.181, 391.736]],
[u'8901SYR8', [122.81, 380.822, 162.017, 391.736]]]

I thought I could iterate through the values for each key and merge the text and coordinates if the space was less than some value and then delete the value that got merged, but this would throw off the iteration. All the possibilities I come up with are really clunky, such as marking the leftovers from merges with a character to indicate deletion later but my function started merging these as well.

Thanks

@Lattyware, thanks again for your help. I tried implementing your suggestions and they are mostly working, but I think I am not fully grasping the idea of the groupby. Specifically why in your example it did not do a merge without a group change, but it does with my modifications (such as the merge after the 8 in the 8901SYR8)? The result in my code is that some of my lines split the first letter of the string from the rest:

{380.822: [
  (u'1 ', [61.2, 380.822, 65.622, 391.736]),
  (u'p', [81.738, 380.822, 83.503, 391.736]),
  (u'iece ', [84.911, 380.822, 89.333, 391.736]),
  (u'8', [122.81, 380.822, 127.232, 391.736]),
  (u'901SYR8 ', [127.723, 380.822, 132.146, 391.736]),
  (u'M', [172.239, 380.822, 178.864, 391.736]),
  (u'ultipurpose Aluminum (Alloy 6061) .125" Thick Sheet, 12"'...]}

The adaptations I made are:

xtol=7

def xDist(rCur,rPrv):
    if rPrv == None: output=False
    else: return not rCur[1][0]-rPrv[1][2] < xtol

def split(row):
    ret = xDist(row, split.previous)
    print "split",ret,row,split.previous
    split.previous = row
    return ret
split.previous = None

def merge(group):
    letters, position_groups = zip(*group)
    return "".join(letters), next(iter(position_groups))

def group(value):
    return [merge(group) for isspace, group in
            itertools.groupby(value, key=split)]

print({key: group(value) for key, value in old.items()})

and the print output is:

...
split False [u'9', [127.723, 380.822, 132.146, 391.736]] [u'8', [122.81, 380.822, 127.232, 391.736]]
merge (u'8',) ([122.81, 380.822, 127.232, 391.736],)
split False [u'0', [132.636, 380.822, 137.059, 391.736]] [u'9', [127.723, 380.822, 132.146, 391.736]]
split False [u'1', [137.55, 380.822, 141.972, 391.736]] [u'0', [132.636, 380.822, 137.059, 391.736]]
split False [u'5', [142.463, 380.822, 146.885, 391.736]] [u'1', [137.55, 380.822, 141.972, 391.736]]
split False [u'K', [147.376, 380.822, 152.681, 391.736]] [u'5', [142.463, 380.822, 146.885, 391.736]]
split False [u'2', [153.172, 380.822, 157.595, 391.736]] [u'K', [147.376, 380.822, 152.681, 391.736]]    
split False [u'8', [157.595, 380.822, 162.017, 391.736]] [u'2', [153.172, 380.822, 157.595, 391.736]]
split False [u' ', [162.017, 380.822, 164.228, 391.736]] [u'8', [157.595, 380.822, 162.017, 391.736]]
split True [u'M', [172.239, 380.822, 178.864, 391.736]] [u' ', [162.017, 380.822, 164.228, 391.736]]
merge (u'9', u'0', u'1', u'S', u'Y', u'R', u'8', u' ') ([127.723, 380.822, 132.146, 391.736], [132.636, 380.822, 137.059, 391.736], [137.55, 380.822, 141.972, 391.736], [142.463, 380.822, 146.885, 391.736], [147.376, 380.822, 152.681, 391.736], [153.172, 380.822, 157.595, 391.736], [157.595, 380.822, 162.017, 391.736], [162.017, 380.822, 164.228, 391.736])
split False [u'u', [179.292, 380.822, 183.714, 391.736]] [u'M', [172.239, 380.822, 178.864, 391.736]]
merge (u'M',) ([172.239, 380.822, 178.864, 391.736],)
split False [u'l', [184.142, 380.822, 185.908, 391.736]] [u'u', [179.292, 380.822, 183.714, 391.736]]
Was it helpful?

Solution

The trick is to build up a new dictionary (and inner lists), rather than trying to modify the old one. The itertools module provides what you need:

new = {}
for key, value in old.items():
    values = []
    for isspace, group in itertools.groupby(value, key=lambda x: x[0] == " "):
        if not isspace:
            letters, coords = zip(*group)
            values.append(("".join(letters), next(iter(coords))))
    new[key] = values

Here I am just taking the first of the co-ordinates, but of course you could merge those values however you wanted.

Edit: Split into functions for readability, using list/dict comprehensions:

def split(row):
    character, positions = row
    return character == " "

def merge(group):
    letters, position_groups = zip(*group)
    return "".join(letters), next(iter(position_groups))

def group(value):
    return [merge(group) for isspace, group in
            itertools.groupby(value, key=split) if not isspace]

print({key: group(value) for key, value in old.items()})

Giving:

{380.822: [
    ('1', [61.2, 380.822, 65.622, 391.736]), 
    ('piece', [81.738, 380.822, 83.503, 391.736]), 
    ('8901SYR8', [122.81, 380.822, 127.232, 391.736])
]}

Edit:

You mention in your comment using the previous value to compute the grouping - this can be done in many ways, but one of the lightest-weight ways is function attributes e.g:

def split(row):
    ret = some_computation(row, split.previous)
    split.previous = row
    return ret
split.previous = None

And note, of course, that you may not want to do the if not isspace in my example.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top