I didn't make a single change to your parser, but made a few changes to your post-parsing code.
You are not really getting "duplicates", the issue is that you print out the current patient data every time you see a Gleason score, and some of your patient data records include multiple Gleason score entries. If I understand what you are trying to do, here is the pseudo-code I would follow:
accumulator = None
foreach match in (patientDataExpr | gleasonScoreExpr).searchString(source):
if it's a patientDataExpr:
if accumulator is not None:
# we are starting a new patient data record, print out the previous one
print out accumulated data
initialize new accumulator with current match and empty list for gleason data
else if it's a gleasonScoreExpr:
add this expression into the current accumulator
# done with the for loop, do one last printout of the accumulated data
if accumulator is not None:
print out accumulated data
This converts to Python pretty easily:
def printOut(patientDataTuple):
pd,gleasonList = patientDataTuple
print( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]".format(
pd, ','.join(''.join(gl.rhs) for gl in gleasonList)))
accumPatientData = None
for match in partMatch.searchString(TEXT):
if match.patientData:
if accumPatientData is not None:
# this is a new patient data, print out the accumulated
# Gleason scores for the previous one
printOut(accumPatientData)
# start accumulating for a new patient data entry
accumPatientData = (match.patientData, [])
elif match.gleason:
accumPatientData[1].append(match.gleason)
#~ print match.dump()
if accumPatientData is not None:
printOut(accumPatientData)
I don't think I'm dumping out the Gleason data correctly, but you can tune it from here, I think.
EDIT:
You can attach diceGleason
as a parse action to gleason
and get this behavior:
def diceGleasonParseAction(tokens):
def diceGleason(glrhs,gllhs):
if len(glrhs) == 0:
pri = gllhs[0]
sec = gllhs[2]
#~ tot = pri + sec
tot = str(int(pri)+int(sec))
return [pri, sec, tot]
elif len(glrhs) == 1:
pri = gllhs[0]
sec = gllhs[2]
tot = glrhs
return [pri, sec, tot]
else:
pri = glrhs[0]
sec = glrhs[2]
tot = gllhs
return [pri, sec, tot]
pri,sec,tot = diceGleason(tokens.gleason.rhs, tokens.gleason.lhs)
# assign results names for later use
tokens.gleason['pri'] = pri
tokens.gleason['sec'] = sec
tokens.gleason['tot'] = tot
gleason.setParseAction(diceGleasonParseAction)
You just had one typo where you summed pri
and sec
to get tot
, but these are all strings, so you were adding '3' and '4' and getting '34' - converting to ints to do the addition was all that was needed. Otherwise, I kept diceGleason
verbatim internal to diceGleasonParseAction
, to isolate your logic for inferring pri
, sec
, and tot
from the mechanics of embellishing the parsed tokens with new results names. Since the parse action does not return anything new, the tokens are updated in-place, and then carried along to be used later in your output method.