I have a dictionary of characters and their position on a page keyed by their y position (so all characters in a row are under a single key in the dictionary). The data comes from a table from a pdf and I am trying to combine the characters in rows into words based on spacing so that columns are separated as values. So this:
380.822: [[u'1', [61.2, 380.822, 65.622, 391.736]],
[u' ', [65.622, 380.822, 67.834, 391.736]],
[u'p', [81.738, 380.822, 83.503, 391.736]],
[u'i', [84.911, 380.822, 89.333, 391.736]],
[u'e', [90.741, 380.822, 95.163, 391.736]],
[u'c', [96.571, 380.822, 100.548, 391.736]],
[u'e', [100.548, 380.822, 104.97, 391.736]],
[u' ', [104.97, 380.822, 107.181, 391.736]],
[u'8', [122.81, 380.822, 127.232, 391.736]],
[u'9', [127.723, 380.822, 132.146, 391.736]],
[u'0', [132.636, 380.822, 137.059, 391.736]],
[u'1', [137.55, 380.822, 141.972, 391.736]],
[u'S', [142.463, 380.822, 146.885, 391.736]],
[u'Y', [147.376, 380.822, 152.681, 391.736]],
[u'R', [153.172, 380.822, 157.595, 391.736]],
[u'8', [157.595, 380.822, 162.017, 391.736]]]
would become this:
380.822: [[u'1 ', [61.2, 380.822, 67.834, 391.736]],
[u'piece ', [81.738, 380.822, 107.181, 391.736]],
[u'8901SYR8', [122.81, 380.822, 162.017, 391.736]]]
I thought I could iterate through the values for each key and merge the text and coordinates if the space was less than some value and then delete the value that got merged, but this would throw off the iteration. All the possibilities I come up with are really clunky, such as marking the leftovers from merges with a character to indicate deletion later but my function started merging these as well.
Thanks
@Lattyware, thanks again for your help. I tried implementing your suggestions and they are mostly working, but I think I am not fully grasping the idea of the groupby. Specifically why in your example it did not do a merge without a group change, but it does with my modifications (such as the merge after the 8 in the 8901SYR8)? The result in my code is that some of my lines split the first letter of the string from the rest:
{380.822: [
(u'1 ', [61.2, 380.822, 65.622, 391.736]),
(u'p', [81.738, 380.822, 83.503, 391.736]),
(u'iece ', [84.911, 380.822, 89.333, 391.736]),
(u'8', [122.81, 380.822, 127.232, 391.736]),
(u'901SYR8 ', [127.723, 380.822, 132.146, 391.736]),
(u'M', [172.239, 380.822, 178.864, 391.736]),
(u'ultipurpose Aluminum (Alloy 6061) .125" Thick Sheet, 12"'...]}
The adaptations I made are:
xtol=7
def xDist(rCur,rPrv):
if rPrv == None: output=False
else: return not rCur[1][0]-rPrv[1][2] < xtol
def split(row):
ret = xDist(row, split.previous)
print "split",ret,row,split.previous
split.previous = row
return ret
split.previous = None
def merge(group):
letters, position_groups = zip(*group)
return "".join(letters), next(iter(position_groups))
def group(value):
return [merge(group) for isspace, group in
itertools.groupby(value, key=split)]
print({key: group(value) for key, value in old.items()})
and the print output is:
...
split False [u'9', [127.723, 380.822, 132.146, 391.736]] [u'8', [122.81, 380.822, 127.232, 391.736]]
merge (u'8',) ([122.81, 380.822, 127.232, 391.736],)
split False [u'0', [132.636, 380.822, 137.059, 391.736]] [u'9', [127.723, 380.822, 132.146, 391.736]]
split False [u'1', [137.55, 380.822, 141.972, 391.736]] [u'0', [132.636, 380.822, 137.059, 391.736]]
split False [u'5', [142.463, 380.822, 146.885, 391.736]] [u'1', [137.55, 380.822, 141.972, 391.736]]
split False [u'K', [147.376, 380.822, 152.681, 391.736]] [u'5', [142.463, 380.822, 146.885, 391.736]]
split False [u'2', [153.172, 380.822, 157.595, 391.736]] [u'K', [147.376, 380.822, 152.681, 391.736]]
split False [u'8', [157.595, 380.822, 162.017, 391.736]] [u'2', [153.172, 380.822, 157.595, 391.736]]
split False [u' ', [162.017, 380.822, 164.228, 391.736]] [u'8', [157.595, 380.822, 162.017, 391.736]]
split True [u'M', [172.239, 380.822, 178.864, 391.736]] [u' ', [162.017, 380.822, 164.228, 391.736]]
merge (u'9', u'0', u'1', u'S', u'Y', u'R', u'8', u' ') ([127.723, 380.822, 132.146, 391.736], [132.636, 380.822, 137.059, 391.736], [137.55, 380.822, 141.972, 391.736], [142.463, 380.822, 146.885, 391.736], [147.376, 380.822, 152.681, 391.736], [153.172, 380.822, 157.595, 391.736], [157.595, 380.822, 162.017, 391.736], [162.017, 380.822, 164.228, 391.736])
split False [u'u', [179.292, 380.822, 183.714, 391.736]] [u'M', [172.239, 380.822, 178.864, 391.736]]
merge (u'M',) ([172.239, 380.822, 178.864, 391.736],)
split False [u'l', [184.142, 380.822, 185.908, 391.736]] [u'u', [179.292, 380.822, 183.714, 391.736]]