Question

Ok, I finally got my grammar to capture all my test cases, but I have a duplicate (case 3) and a false positive (case 6, "PATTERN 5"). Here are my test cases and my desired output.

I'm still pretty new to python (though able to teach my kids! scary!) so I'm sure there are obvious ways to solve this problem, I'm not even sure this is a pyparsing issue. Here's what my output looks like for now:

['01/01/01','S01-12345','20/111-22-1001',['GLEASON', ['5', '+', '4'], '=', '9']]
['02/02/02','S02-1234','20/111-22-1002',['GLEASON', 'SCORE', ':', ['3', '+', '3'], '=', '6']]
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'GRADE', ['4', '+', '3'], '=', '7']]
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'SCORE', ':', '7', '=', ['4', '+', '3']]]
['04/17/04','S04-123','30/111-22-1004',['GLEASON', 'SCORE', ':', ['3', '+', '4', '-', '7']]]
['05/28/05','S05-1234','20/111-22-1005',['GLEASON', 'SCORE', '7', '[', ['3', '+', '4'], ']']]
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', ['4', '+', '3']]]
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', 'PATTERN', '5']]
['07/22/07','S07-2749','20/111-22-1007',['GLEASON', 'SCORE', '6', '(', ['3', '+', '3'], ')']]

Here's the grammar

num = Word(nums)
arith_expr = operatorPrecedence(num,
    [
    (oneOf('-'), 1, opAssoc.RIGHT),
    (oneOf('* /'), 2, opAssoc.LEFT),
    (oneOf('+ -'), 2, opAssoc.LEFT),
    ])
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
score = (Optional(oneOf('( [')) +
         arith_expr('lhs') +
         Optional(oneOf(') ]')) +
         Optional(oneOf('= -')) +
         Optional(oneOf('( [')) +
         Optional(arith_expr('rhs')) +
         Optional(oneOf(') ]')))
gleason = Group("GLEASON" + Optional("SCORE") + Optional("GRADE") + Optional("PATTERN") + Optional(":") + score)
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
partMatch = patientData("patientData") | gleason("gleason")

and the output function.

lastPatientData = None 
for match in partMatch.searchString(TEXT):
    if match.patientData:
        lastPatientData = match
    elif match.gleason:
        if lastPatientData is None:
            print "bad!" 
            continue 
       # getParts() 
        FOUT.write( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]\n".format(lastPatientData.patientData, match.gleason))

As you can see, the output isn't as good as it looks, I'm just writing to a file and faking some of the syntax. I have been struggling with how to get ahold of the pyparsing intermediate results so I can work with them. Should I just write this out and run a second script that finds the duplicates?

Update, based on Paul McGuire's answer. The output of this function gets me down to one row per entry, but now I'm losing pieces of the score (each Gleason score, intellectually, has the form primary + secondary = total. This is headed for a database, so pri, sec, tot are separate posgresql columns, or, for the output of the parser, comma-separated values)

accumPatientData = None
for match in partMatch.searchString(TEXT):
    if match.patientData:
        if accumPatientData is not None:
             #this is a new patient data, print out the accumulated
             #Gleason scores for the previous one
             writeOut(accumPatientData)
        accumPatientData = (match.patientData, [])
    elif match.gleason:
        accumPatientData[1].append(match.gleason)
if accumPatientData is not None:
    writeOut(accumPatientData)

So now the output looks like this

01/01/01,S01-12345,20/111-22-1001,9
02/02/02,S02-1234,20/111-22-1002,6
03/02/03,S03-1234,31/111-22-1003,7,4+3
04/17/04,S04-123,30/111-22-1004,
05/28/05,S05-1234,20/111-22-1005,3+4
06/18/06,S06-10686,20/111-22-1006,,
07/22/07,S07-2749,20/111-22-1007,3+3

I would like to reach back in there and grab some of those lost elements, rearrange them, find the ones that are missing, and put them all back in. Something like this pseudocode:

def diceGleason(glrhs,gllhs)
    if glrhs.len() == 0:
        pri = gllhs[0]
        sec = gllhs[2]
        tot = pri + sec
        return [pri, sec, tot]
    elif glrhs.len() == 1:
        pri = gllhs[0]
        sec = gllhs[2]
        tot = glrhs
        return [pri, sec, tot]
    else:
        pri = glrhs[0]
        sec = glrhs[2]
        tot = gllhs
        return [pri, sec, tot]

Update 2: Ok, Paul is awesome, but I'm dumb. Having tried exactly what he said, I have tried a few ways to acquire pri, sec, and tot but I'm failing. I keep getting an error like this:

Traceback (most recent call last):
  File "Stage1.py", line 81, in <module>
    writeOut(accumPatientData)
  File "Stage1.py", line 47, in writeOut
    FOUT.write( "{0.accDate},{0.accNum},{0.patientNum},{1.pri},{1.sec},{1.tot}\n".format( pd, gleaso
nList))
AttributeError: 'list' object has no attribute 'pri'

These AttributeErrors are what I keep getting. Clearly I don't understand what's going on between (Paul, I have the book, I swear it's open in front of me, and I don't understand). Here's my script. Is something in the wrong place? Am I calling the results wrong?

Was it helpful?

Solution

I didn't make a single change to your parser, but made a few changes to your post-parsing code.

You are not really getting "duplicates", the issue is that you print out the current patient data every time you see a Gleason score, and some of your patient data records include multiple Gleason score entries. If I understand what you are trying to do, here is the pseudo-code I would follow:

accumulator = None
foreach match in (patientDataExpr | gleasonScoreExpr).searchString(source):

    if it's a patientDataExpr:
        if accumulator is not None:
            # we are starting a new patient data record, print out the previous one
            print out accumulated data
        initialize new accumulator with current match and empty list for gleason data

    else if it's a gleasonScoreExpr:
        add this expression into the current accumulator

# done with the for loop, do one last printout of the accumulated data
if accumulator is not None:
    print out accumulated data

This converts to Python pretty easily:

def printOut(patientDataTuple):
    pd,gleasonList = patientDataTuple
    print( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]".format(
        pd, ','.join(''.join(gl.rhs) for gl in gleasonList)))

accumPatientData = None
for match in partMatch.searchString(TEXT):
    if match.patientData:
        if accumPatientData is not None:
            # this is a new patient data, print out the accumulated 
            # Gleason scores for the previous one
            printOut(accumPatientData)

        # start accumulating for a new patient data entry
        accumPatientData = (match.patientData, [])

    elif match.gleason:
        accumPatientData[1].append(match.gleason)
    #~ print match.dump()

if accumPatientData is not None:
    printOut(accumPatientData)

I don't think I'm dumping out the Gleason data correctly, but you can tune it from here, I think.

EDIT:

You can attach diceGleason as a parse action to gleason and get this behavior:

def diceGleasonParseAction(tokens):
    def diceGleason(glrhs,gllhs):
        if len(glrhs) == 0:
            pri = gllhs[0]
            sec = gllhs[2]
            #~ tot = pri + sec
            tot = str(int(pri)+int(sec))
            return [pri, sec, tot]
        elif len(glrhs) == 1:
            pri = gllhs[0]
            sec = gllhs[2]
            tot = glrhs
            return [pri, sec, tot]
        else:
            pri = glrhs[0]
            sec = glrhs[2]
            tot = gllhs
            return [pri, sec, tot]

    pri,sec,tot = diceGleason(tokens.gleason.rhs, tokens.gleason.lhs)

    # assign results names for later use
    tokens.gleason['pri'] = pri
    tokens.gleason['sec'] = sec
    tokens.gleason['tot'] = tot

gleason.setParseAction(diceGleasonParseAction)

You just had one typo where you summed pri and sec to get tot, but these are all strings, so you were adding '3' and '4' and getting '34' - converting to ints to do the addition was all that was needed. Otherwise, I kept diceGleason verbatim internal to diceGleasonParseAction, to isolate your logic for inferring pri, sec, and tot from the mechanics of embellishing the parsed tokens with new results names. Since the parse action does not return anything new, the tokens are updated in-place, and then carried along to be used later in your output method.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top